Runtime adaptable search processor

ABSTRACT

A runtime adaptable search processor is disclosed. The search processor provides high speed content search capability to meet the performance need of network line rates growing to 1 Gbps, 10 Gbps and higher. The search processor provides a unique combination of NFA and DFA based search engines that can process incoming data in parallel to perform the search against the specific rules programmed in the search engines. The processor architecture also provides capabilities to transport and process Internet Protocol (IP) packets from Layer 2 through transport protocol layer and may also provide packet inspection thrugh Layer 7. Further, a runtime adaptable processor is coupled to the protocol processing hardware and may be dynamically adapted to perform hardware tasks as per the needs of the network traffic being sent or received and/or the policies programmed or services or applications being supported. A set of engines may perform pass-through packet classification, policy processing and/or security processing enabling packet streaming through the architecture at nearly the full line rate. A high performance content search and rules processing security processor is disclosed which may be used for application layer and network layer security. Scheduler schedules packets to packet processors for processing. An internal memory or local session database cache stores a session information database for a certain number of active sessions. The session information that is not in the internal memory is stored and retrieved to/from an additional memory. An application running on an initiator or target can in certain instantiations register a region of memory, which is made available to its peer(s) for access directly without substantial host intervention through RDMA data transfer. A security system is also disclosed that enables a new way of implementing security capabilities inside enterprise networks in a distributed manner using a protocol processing hardware with appropriate security features.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a divisional application of U.S. patent applicationSer. No. 11/323,165, filed 30 Dec. 2005, which is hereby incorporated byreference in its entirety.

BACKGROUND OF THE INVENTION

This invention relates generally to content search, storage andnetworking semiconductors and in particular to high performance contentsearch, network storage and security processors that can be used withinnetworking, storage, security, bioinformatics, chipsets, servers, searchengines and the like.

Many modern applications depend on fast information search andretrieval. With the advent of the world-wide-web and the phenomenalgrowth in its usage, content search has become a critical capability. Alarge number of servers get deployed in web search applications due tothe performance limitations of the state of the art microprocessors forregular expression driven search.

There have been significant research and development resources devotedto the topic of searching of lexical information or patterns in strings.Regular expressions have been used extensively since the mid 1950s todescribe the patterns in strings for content search, lexical analysis,information retrieval systems and the like. Regular expressions werefirst studied by S. C. Kleene in mid-1950s to describe the events ofnervous activity. It is well understood in the industry that regularexpression (RE) can also be represented using finite state automata(FSA). Non-deterministic finite state automaton (NFA) and deterministicfinite state automaton (DFA) are two types of FSAs that have been usedextensively over the history of computing. Rabin and Scott were thefirst to show the equivalence of DFA and NFA as far as their ability torecognize languages in 1959. In general a significant body of researchexists on regular expressions. Theory of regular expressions can befound in “Introduction to Automata Theory, Languages and Computation” byHoperoft and Ullman and a significant discussion of the topics can alsobe found in book “Compilers: Principles, Techniques and Tools” by Aho,Sethi and Ullman.

Internet protocol (IP) is the most prevalent networking protocoldeployed across various networks like local area networks (LANs), metroarea networks (MANs) and wide area networks (WANs). Storage areanetworks (SANs) are predominantly based on Fibre Channel (FC)technology. There is a need to create IP based storage networks.

When transporting block storage traffic on IP designed to transport datastreams, the data streams are transported using Transmission ControlProtocol (TCP) that is layered to run on top of IP. TCP/IP is a reliableconnection/session oriented protocol implemented in software within theoperating systems. TCP/IP software stack is very slow to handle the highline rates that will be deployed in future. Currently, a 1 GHz processorbased server running TCP/IP stack, with a 1 Gbps network connection,would use 50-70% or more of the processor cycles, leaving minimal cyclesavailable for the processor to allocate to the applications that run onthe server. This overhead is not tolerable when transporting storagedata over TCP/IP as well as for high performance IP networks. Hence, newhardware solutions would accelerate the TCP/IP stack to carry storageand network data traffic and be competitive to FC based solutions. Inaddition to the TCP protocol other protocols such as SCTP and UDPprotocols can be used, as well as other protocols appropriate fortransporting data streams.

Computers are increasingly networked within enterprises and around theworld. These networked computers are changing the paradigm ofinformation management and security. Vast amounts of information,including highly confidential, personal and sensitive information is nowbeing generated, accessed and stored over the network, which informationneeds to be protected from unauthorized access. Further, there is acontinuous onslaught of spam, viruses, and other inappropriate contenton the users through email, web access, instant messaging, web downloadand other means, resulting in significant loss of productivity andresources.

Enterprise and service provider networks are rapidly evolving from10/100 Mbps line rates to 1 Gbps, 10 Gbps and higher line rates.Traditional model of perimeter security to protect information systemspose many issues due to the blurring boundary of an organization'sperimeter. Today as employees, contractors, remote users, partners andcustomers require access to enterprise networks from outside, aperimeter security model is inadequate. This usage model poses serioussecurity vulnerabilities to critical information and computing resourcesfor these organizations. Thus the traditional model of perimetersecurity has to be bolstered with security at the core of the network.Further, the convergence of new sources of threats and high line ratenetworks is making software based perimeter security to stop theexternal and internal attacks inadequate. There is a clear need forenabling security processing in hardware inside core or end systemsbeside a perimeter firewall as one of the prominent means of security tothwart ever increasing security breaches and attacks.

FBI and other leading research institutions have reported in recentyears that over 70% of intrusions in organizations have been internal.Hence a perimeter defense relying on protecting an organization fromexternal attacks is not sufficient as discussed above. Organizations arealso required to screen outbound traffic to prevent accidental ormalicious disclosure of proprietary and confidential information as wellas to prevent its network resources from being used to proliferate spam,viruses, worms and other malware. There is a clear need to inspect thedata payloads of the network traffic to protect and secure anorganization's network for inbound and outbound security.

Data transported using TCP/IP or other protocols is processed at thesource, the destination or intermediate systems in the network or acombination thereof to provide data security or other services likesecure sockets layer (SSL) for socket layer security, Transport layersecurity, encryption/decryption, RDMA, RDMA security, application layersecurity, virtualization or higher application layer processing, whichmay further involve application level protocol processing (for example,protocol processing for HTTP, HTTPS, XML, SGML, Secure XML, other XMLderivatives, Telnet, FTP, IP Storage, NFS, CIFS, DAFS, and the like).Many of these processing tasks put a significant burden on the hostprocessor that can have a direct impact on the performance ofapplications and the hardware system. Hence, some of these tasks need tobe accelerated using dedicated hardware for example SSL, or TLSacceleration. As the usage of XML increases for web applications, it isexpected to put a significant performance burden on the host processorand would also benefit significantly from hardware acceleration.Detection of spam, viruses and other inappropriate content require deeppacket inspection and analysis. Such tasks can put huge processingburden on the host processor and can substantially lower network linerate. Hence, deep packet content search and analysis hardware is alsorequired.

Internet has become an essential tool for doing business at small tolarge organizations. HTML based static web is being transformed into adynamic environment over last several years with deployment of XML basedservices. XML is becoming the lingua-franca of the web and its usage isexpected to increase substantially. XML is a descriptive language thatoffers many advantages by making the documents self-describing forautomated processing but is also known to cause huge performanceoverhead for best of class server processors. Decisions can be made byprocessing the intelligence embedded in XML documents to enable businessto business transactions as well as other information exchange. However,due to the performance overload on the best of class server processorsfrom analyzing XML documents, they cannot be used in systems thatrequire network line rate XML processing to provide intelligentnetworking. There is a clear need for acceleration solutions for XMLdocument parsing and content inspection at network line rates which areapproaching 1 Gbps and 10 Gbps, to realize the benefits of a dynamic webbased on XML services.

Regular expressions can be used to represent the content search stringsfor a variety of applications like those discussed above. A set ofregular expressions can then form a rule set for searching for aspecific application and can be applied to any document or stream ofdata for examination of the same. Regular expressions are used indescribing anti-spam rules, anti-virus rules, XML document searchconstructs and the like. These expressions get converted into NFAs orDFAs for evaluation on a general purpose processor. However, significantperformance and storage limitations arise for each type of therepresentation. For example an N character regular expression can takeup to the order of 2^(N) memory for the states of a DFA, while the samefor an NFA is in the order of N. On the other hand the performance forthe DFA evaluation for an M byte input data stream is in the order of Mmemory accesses and the order of (N*M) processor cycles for the NFArepresentation on modern microprocessors.

When the number of regular expressions increases, the impact on theperformance deteriorates as well. For example, in an application likeanti-spam, there may be hundreds of regular expression rules. Theseregular expressions can be evaluated on the server processors usingindividual NFAs or DFAs. It may also be possible to create a compositeDFA to represent the rules. Assuming that there are X REs for anapplication, then a DFA based representation of each individual RE wouldresult up to the order of (X*2^(N)) states however the evaluation timewould grow up to the order of (X*N) memory cycles. Generally, due to thepotential expansion in the number of states for a DFA they would need tobe stored in off chip memories. Using a typical access time latency ofmain memory systems of 100 ns, it would require about (X*100 ns*N*M)time to process an X RE DFA with N states over an M byte data stream.This can result in tens of Mbps performance for modest size of X, N & M.Such performance is obviously significantly below the needs of today'snetwork line rates of 1 Gbps to 10 Gbps. On the other hand, if acomposite DFA is created, it can result in an upper bound of storage inthe order of 2^(N*X) which may not be within physical limits of memorysize for typical commercial computing systems even for a few hundredREs. Thus the upper bound in memory expansion for DFAs can be asignificant issue. Then on the other hand NFAs are non-deterministic innature and can result in multiple state transitions that can happensimultaneously. NFAs can only be processed on a state of the artmicroprocessor in a scalar fashion, resulting in multiple executions ofthe NFA for each of the enabled paths. X REs with N characters onaverage can be represented in the upper bound of (X*N) states as NFAs.However, each NFA would require M iterations for an M-byte stream,causing an upper bound of (X*N*M*processor cycles per loop). Assumingthe number of processing cycles are in the order of 10 cycles, then fora best of class processor at 4 GHz, the processing time can be around(X*N*M*2.5 ns), which for a nominal N of 8 and X in tens can result inbelow 100 Mbps performance. There is a clear need to create highperformance regular expression based content search processors which canprovide the performance in line with the network rates which are goingto 1 Gbps and 10 Gbps.

The methods for converting a regular expression to NFA and DFA are wellknown. The resulting automata are able to distinguish whether a stringbelongs to the language defined by the regular expression however it isnot very efficient to figure out if a specific sub-expression of aregular expression is in a matching string or the extent of the string.Tagged NFAs enable such queries to be conducted efficiently withouthaving to scan the matching string again. For a discussion on Tagged NFAplease refer to the paper “NFAs with Tagged Transitions, theirConversion to Deterministic Automata and Application to RegularExpressions”, by Ville Laurikari, Helsinki University of Technology,Finland.

U.S. Patent Applications, 20040059443 and 20050012521, describe a methodand apparatus for efficient implementation and evaluation of statemachines and programmable finite state automata. These applications showan apparatus that is used to evaluate regular expressions using an arrayof NFAs to create high performance processing of regular expressions.The application recognizes the upper bound in the storage issues forDFAs as a reason to implement regular expressions using NFAs. However,the applications fails to recognize that even though the DFA worst casestorage requirement is substantially higher compared to NFAs many DFAshave less storage needs than NFAs. DFAs for many regular expressions canresult in lower number of states compared to an NFA. For example in ananti-spam application, based on the open source tool SpamAssassin, alarge number of the regular expression rules result in DFAs which aresmaller than NFAs. Hence, it is important not to ignore DFAimplementation based only on the worst case scenario. These patentapplications also create NFA engines that process a single RE per NFAblock. Thus if a RE uses fewer states than the minimum states of the NFAblock, there is no provision to be able to use multiple REssimultaneously in the same block. In my invention, I describe a contentsearch processor which uses an array of runtime adaptable searchengines, where the search engines may be runtime adaptable DFA searchengines or runtime adaptable NFA search engines or a combination thereofto evaluate regular expressions. Content search engine of my searchprocessor also provides flexibility of using multiple REs per NFA or DFAengine. My invention also provides capabilities to support Tagged NFAimplementations which are not supported or discussed in theseapplications. Further, these applications do not address the need ofdynamically configuring the hardware or the rules being applied based onthe transported data being sent to or received from a network. Theprocessors of my invention can be dynamically adapted to apply hardwarebased rule sets dependent on the transported data which is not describedin the above applications. Further, my invention shows that certain DFAscan be more hardware resource efficient to implement compared to NFAsand can enable today's state of the art FPGAs to implement a largenumber of regular expressions without having to devote large investmentsin creating application specific integrated circuits using advancedprocess technologies. This is also specifically discussed as notfeasible to do in the above applications. My invention also showscontent search acceleration can be used to improve applicationacceleration through content search application programmer interface(API) and the search processor of this invention.

Hardware acceleration for each type of network data payload can beexpensive when a specialized accelerator is deployed for each individualtype of network data. There is a clear need for a processor architecturethat can adapt itself to the needs of the network data providing thenecessary acceleration and thereby reduce the impact on the hostperformance. This patent describes such a novel architecture whichadapts itself to needs of the network data. The processor of this patentcan be reused and adapted for differing needs of the different types ofthe payload and still offer the benefits of hardware acceleration. Thiscan have a significant reduction in the cost of the accelerationsolutions deployment compared to dedicated application-specificaccelerators.

Dynamically reconfigurable computing has been an area that has receivedsignificant research and development interest to address the need ofreconfiguring hardware resources to suit application needs. The primaryfocus of the research has been towards creating general purposemicroprocessor alternatives that can be adapted with new instructionexecution resources to suit application needs.

Field programmable gate arrays (FPGA) have evolved from simple AND-ORlogic blocks to more complex elements that provide a large number ofprogrammable logic blocks and programmable routing resources to connectthese together or to Input/Output blocks. U.S. Pat. No. 5,600,845describes an integrated circuit computing device comprising adynamically configurable FPGA. The gate array is configured to create aRISC processor with a configurable instruction execution unit. Thisdynamic re-configurability allows the dynamically reconfigurableinstruction execution unit to be changed to implement operations inhardware which may be time consuming to run in software. Such anarrangement requires a preconfigured instruction set to execute theincoming instruction and if an instruction is not present it has to betreated as an exception which then has a significant processingoverhead. The invention in U.S. Pat. No. 5,600,845 addresses thelimitation of general purpose microprocessors but does not address theneed of dynamically configuring the hardware based on the transporteddata being sent to or received from a network.

U.S. Patent Application number 20030097546 describes a reconfigurableprocessor which receives an instruction stream that is inspected by ainstruction test module to decide if the instruction is supported byexisting non reconfigurable hardware or the reconfigurable hardwareconfigured by a software routine and executes the instruction streambased on the test result. If the instruction is not supported then theprocessor decides a course of action to be taken including executing theinstruction stream in software. The patent application number20030097546 also does not address the need of dynamically configuringthe hardware based on the transported data being sent to or receivedfrom a network.

U.S. Patent Application number 20040019765 describes a pipelinedreconfigurable dynamic instruction set processor. In that application,dynamically reconfigurable pipeline stages under control of amicrocontroller are described. This is yet another dynamicallyreconfigurable processor that can adapt its pipeline stages and theirinterconnections based on the instructions being processed as analternative to general purpose microprocessors.

The field of reconfigurable computing has been ripe with researchtowards creating dynamically reconfigurable logic devices either asFPGAs or reconfigurable processors as described above as primarilyaddressing the limitations of general purpose processors by addingreconfigurable execution units or reconfigurable coprocessors. Forexample, “Reconfigurable FPGA processor”, diploma thesis paper byAndreas Romer from Swiss Federal Institue of Technology, targets theneed of creating an ASIC-like performance and area, but general purposeprocessor level flexibility, by dynamically creating executionfunctional units in a reconfigurable part of a reconfigurable FPGA likeXilinx Virtex and XC6200 devices. Similarly, the paper by J. R. Hauserand J Wawrzynek entitled Garp: A MIPS Processor With a ReconfigurableCoprocessor published in Proceedings of the IEEE Symposium on FPGAs forCustom Computing Machines (FCCM '97), targets the need for creatingcustom co-processing support to a MIPS processor addressing thelimitations of the general purpose processing capabilities of the MIPSprocessor.

Published research or patent applications have not addressed the need ofdynamically configuring the hardware based on transported data as wellas actions to be taken and applications/services to be deployed for thatspecific data being sent to or received from a network. This patentdescribes a novel architecture which adapts itself to the needs of thenetwork data and is run-time adaptable to perform time consumingsecurity policy operations or application/services or other dataprocessing needs of the transported data and defined policies of thesystem incorporating this invention. The architecture also comprises adeep packet inspection engine that may be used for detecting spam,viruses, digital rights management information, instant messageinspection, URL matching, application detection, malicious content, andother content and applying specific rules which may enable anti-spam,anti-virus and the like capabilities.

SUMMARY OF THE INVENTION

I describe a high performance run time adaptable search processor usinghardware acceleration for regular expressions. The regular expressionsare converted into equivalent DFAs and NFAs and then the most costeffective solution is chosen. The content search processor comprises ofa set of DFA processing engines and NFA processing engines. Theconverted REs are then mapped either to an appropriate NFA or DFAengine. The processor may also include a set of composite DFA processingengines which can be used to absorb the growth in the number of rulesbeyond the number supported by the array of FSA engines. Thus new systemhardware may not be necessary until the composite DFA results insignificant memory usage which is beyond that available with the memoryassociated with the content search processor.

In many content search applications like security, there is a need toconstantly update the rules or the signatures being used to detectmalicious traffic. In such applications it is critical that a solutionbe adaptable to keep up with the constantly evolving nature of thesecurity threat. In an always connected type of usage models, it isextremely important to have the latest security threat mitigation rulesupdated in the security system on a frequent basis. When a composite DFAtype architecture is used, compiling and releasing any new securityrules or policy can consume a large amount of time, where the updatesmay not be timely to avoid the impact of the security threat. In suchenvironments the release of new rule base may take up to 8 to 24 hours,which is quite delayed response to constantly evolving threat. In theprocessor of this invention, that issue is addressed since the releaseof new rules is a matter of converting those rules into NFAs and DFAsand updating only these very small rules into the content searchprocessor. Thus the response to new threats can be immediate and wouldnot require huge delays which occur from integration of the new rules inthe composite rule base and converting those into composite DFAs.

There are several instances of REs which include only a few states. Forexample if the content search includes looking for *.exe or *.com or*.html or the like, the NFA or DFAs for these REs include a small numberof states. Thus if all DFA or NFA engines support say 16 states, then itmay be possible to include multiple rules per engine. This inventionenables the maximum utilization of the FSA engines by allowing multiplerules per FSA engine. The engines also provide FSA extension logic tochain the base engines together to create super blocks that can handlelarger FSAs.

Berry and Sethi in their paper “From Regular Expressions toDeterministic Automata” Published in Theoretical Computer Science in1986, showed that regular expressions can be represented by NFAs suchthat a given state in the state machine is entered by one symbol, unlikethe Thompson NFA. Further, the Berry-Sethi NFAs are ε-free. A ‘V’ termRE can be represented using ‘V+1’ states NFA using Berry-Sethi like NFArealization method. The duality of Berry-Sethi method also exists whereall transitions that lead the machine out of a state are dependent onthe same symbol. This is shown in the paper “A Taxonomy of finiteautomata construction algorithms” by Bruce Watson published in 1994 insection 4.3. I show a method of creating NFA search engine architectureleveraging the principles of Berry-Sethi's NFA realization and the dualof their construct. The NFA search engine is programmable to realize anarbitrary regular expression.

In this invention I also show how the content search processor of thisinvention can be used to create general application acceleration in acompute device like a server, personal computer, workstation or thelike. I show an example content search API which can be used as ageneral facility that may get offered by an operating system for thosedevices to applications running on them which can utilize the contentsearch processor and significantly improve the performance of thoseapplications compared to having them run on the general purposeprocessor of these devices.

An example application of anti-spam is illustrated in this applicationwhich can be accelerated to become a high line rate application unlikecurrent solutions which run on general purpose processor of the computeservers on which they run.

I also illustrate an example of using the content search processorcoupled with a protocol processor such that the content searchprocessing can be done in line with the traffic and appropriate actionsbe taken dependent on the traffic content. The processor of thisinvention can thus be used to apply content specific search rules fortraffic stream. For example, if a specific packet or stream of packetscarry SMTP traffic, then the protocol processor can let the contentsearch processor know that and provide the appropriate flow information.Then, the content search processor retrieves the flow context for thecurrent flow from the memory and retrieves the SMTP application rulescontext from the application memory associated with the search engines.The search engines get configured to process the content of the specificflow and its associated application context. Thus content search cancontinue across multiple packets of the same flow even when packets inthe flow arrive with significant time gap between them and multipledifferent application rule contexts can also be applied. By using suchan architecture a significant performance benefit can result compared toarchitectures where the rule context can not be changed rapidly, andwhere the context may need to be brought in from global memory into eachof the FSA engines.

I also describe a high performance hardware processor that sharplyreduces the TCP/IP protocol stack overhead from host processor andenables a high line rate storage and data transport solution based onIP.

This patent also describes the novel high performance processor thatsharply reduces the TCP/IP protocol stack overhead from the hostprocessor and enables high line rate security processing includingfirewall, encryption, decryption, intrusion detection and the like. Thispatent also describes a content inspection architecture that may be usedfor detecting spam, viruses, digital rights management information,instant message inspection, URL matching, application detection,malicious content, and other content and applying specific rules whichmay enable anti-spam, anti-virus and the like capabilities. The contentinspection engine may be used for detecting and enforcing digital rightsmanagement rules for the content. The content inspection engine may alsobe used for URL matching, string searches, content based load balancing,sensitive information search like credit card numbers or social securitynumbers or health information or the like. The content inspection engineresults may be used to direct the operation of the run-time adaptableprocessor as well.

This patent also describes a novel processor architecture that isrun-time adaptable to the needs of the data sent to or received from anetwork. The run-time adaptable features of this processor can be usedto deploy services that operate on network data under control of userdefinable policies. The adaptable processor may also be used todynamically offload compute intensive operations from the hostprocessor, when not performing operations on the network data or inconjunction with network data processing if enough adaptable hardwareresources are available. The processor performs protocol processing likeTCP/IP or SCTP or UDP or the like using the high performance protocolprocessor disclosed and then uses an adaptable processing hardware toprovide other functions or services like socket layer security,Transport layer security, encryption/decryption, RDMA, RDMA security,application layer security, content inspection, deep packet inspection,virus scanning or detection, policy processing, content based switching,load balancing, content based load balancing, virtualization or higherapplication layer processing or a combination thereof. Higher layerprocessing may further involve application level protocol processing(for example, protocol processing for HTTP, HTTPS, XML, SGML, SecureXML, other XML derivatives, Telnet, FTP, IP Storage, NFS, CIFS, DAFS andthe like) which may also be accelerated by dynamically adapting orreconfiguring the processor of this patent. This can significantlyreduce the processing overhead on the host processor of the targetsystem, without adding major system cost of adding dedicated acceleratorhardware.

The processing capabilities of a system deploying the runtime adaptableprocessor of this patent can continue to expand and improve without theneed for continually upgrading the system with host processor to achieveperformance benefits. The hardware of the processor may comprisecomputational elements organized into compute clusters. Computationalelements may provide logical and arithmetic operations beside otherfunctions. A computational element may operate on 1-bit, 2-bit, 4-bit,8-bit or n-bit data sizes as may be chosen by the implementation. Thusmultiple computational elements may together provide the desired size ofoperators. For example, if each computational element providesoperations on a largest data size of 8-bits, then to operate on 32-bitoperands, four computational elements may each operate on a byte of the32-bit operand. The computational elements within the compute clusterscan be programmatically interconnected with each other using adynamically changeable or adaptable interconnection network. The computeclusters may also be dynamically interconnected with each otherprogrammatically forming an adaptable network. Thus arbitraryinterconnections can be created between the array of computationalelements within a compute cluster as well as outside of the computeclusters. The computational elements may each be dynamically changed toprovide necessary function(s) or operation(s) by programmaticallyselecting and connecting the necessary logic blocks through muxes orother logic blocks or other means. The computational elements may alsobe simple processors with ALU and other functional blocks. In this caseto change hardware function of the computational element (CE), it isprogrammatically instructed to execute certain function(s) oroperation(s). The operation(s) selected for each of the computationalelement can be different. Thus the processor hardware of this patent canbe dynamically i.e. during operation or at runtime, be changed oradapted to provide a different functionality. For explanation purposeslet us take an example of a compute cluster with 8 computationalelements each providing 8-bit operations. If we want to perform two32-bit operations like a 32-bit addition followed by an n-bit shiftoperation, then the computational elements may be grouped into two setsof four each. The first group would be programmed to provide additionoperation, where each of them may operate on 8-bits at a time. Theappropriate carry and flags and other outputs would be available throughthe interconnections between the CEs which may be programmaticallyselected. The second group of CEs would be programmed to provide a shiftoperation on the incoming data. One such setup of the CEs may be calledan Avatar or a virtual configuration/setup. The CEs may then continue toprovide these operations on the input operands for a period of time thatthis avatar is maintained. Then it is possible to dynamically change theavatar to a new avatar. For instance in the example used above, let usassume that after a certain time period, which may be as small as aclock period or multiple clock periods or other period, the processorneeds to switch from providing acceleration support from 32-bit Add,followed by Shift to something like two 16-bit Subtraction followed bytwo 16-bit logical AND. In such an instance the hardware is setup toform four groups of two CEs, each group operating on 16-bit operands.First four CEs in this case may now be dynamically switched or changedfrom providing addition function to subtraction function. Further, theymay now be dynamically switched to operate in two groups to provide16-bit operation, instead of one group providing 32-bit operation in theprevious avatar. Similarly, the second group of four CEs from theprevious avatar may now be dynamically switched or changed to providelogical AND operations and may also be setup as two groups providing16-bit operations. This forms a new avatar of the hardware which hasbeen dynamically changed as per the need of the required functionalityat the time. Thus the runtime adaptable protocol processor of thispatent can change its functions at a fine granularity along with theinterconnections of these operators provided by the CEs to form runtimechangeable or adaptable hardware platform. The operations supported maybe lot more complex than those used in the examples discussed above. Theexamples were provided primarily to provide a better appreciation of thecapabilities and were deliberately kept simplistic. Though the examplessuggested a unidirectional flow, this is not to be construed as the onlymode of operation. The outputs from the operations in the examples abovecould be recycled to the first group of CEs which would allow apipelined loop of the hardware. More complex scenarios are feasible withthe dynamically adaptable nature of the CEs and the interconnectionnetwork, where different stages of CEs may be switched over a period oftime to provide different functionality as may be required by thealgorithm or application or service being enabled in hardware. Pipelinedstages of operations are thus possible with arbitrary loop backs asnecessary. Hence the applications or services being accelerated orsupported in hardware can increase over time, where the users may decideto accelerate applications of choice by mapping them appropriately tothe runtime adaptable protocol processor as and when necessary orfeasible due to cost, performance, resources, application discovery,application development or any other reasons that may cause them toinvent or develop new applications or services. The hardware system maythus be adapted to use such applications without the need for incurringcosts of buying new systems or accelerators and the like. The systemcapabilities can be increased over time as new services are developedand/or deployed that exploit the adaptable component of the processor ofthis invention. The new services or policies or a combination thereofmay be deployed to the appropriate systems over a network under usercontrol.

Traditionally, TCP/IP networking stack is implemented inside theoperating system kernel as a software stack. The software TCP/IP stackimplementation consumes, as mentioned above, more than 50% of theprocessing cycles available in a 1 GHz processor when serving a 1 Gbpsnetwork. The overhead comes from various aspects of the software TCP/IPstack including checksum calculation, memory buffer copy, processorinterrupts on packet arrival, session establishment, session tear downand other reliable transport services. The software stack overheadbecomes prohibitive at higher lines rates. Similar issues occur innetworks with lower line rates, like wireless networks, that use lowerperformance host processors. A hardware implementation can remove theoverhead from the host processor.

The software TCP/IP networking stack provided by the operating systemsuses up a majority of the host processor cycles. TCP/IP is a reliabletransport that can be run on unreliable data links. Hence, when anetwork packet is dropped or has errors, TCP does the retransmission ofthe packets. The errors in packets are detected using checksum that iscarried within the packet. The recipient of a TCP packet performs thechecksum of the received packet and compares that to the receivedchecksum. This is an expensive compute intensive operation performed oneach packet involving each received byte in the packet. The packetsbetween a source and destination may arrive out of order and the TCPlayer performs ordering of the data stream before presenting it to theupper layers. IP packets may also be fragmented based on the maximumtransfer unit (MTU) of the link layer and hence the recipient isexpected to de-fragment the packets. These functions result intemporarily storing the out of order packets, fragmented packets orunacknowledged packets in memory on the network card for example. Whenthe line rates increase to above 1 Gbps, the memory size overhead andmemory speed bottleneck resulting from these add significant cost to thenetwork cards and also cause huge performance overhead. Another functionthat consumes a lot of processor resources is the copying of the datato/from the network card buffers, kernel buffers and the applicationbuffers.

Microprocessors are increasingly achieving their high performance andspeed using deep pipelining and super scalar architectures. Interruptingthese processors on arrival of small packets will cause severeperformance degradation due to context switching overhead, pipelineflushes and refilling of the pipelines. Hence interrupting theprocessors should be minimized to the most essential interrupts only.When the block storage traffic is transported over TCP/IP networks,these performance issues become critical, severely impacting thethroughput and the latency of the storage traffic. Hence the processorintervention in the entire process of transporting storage traffic needsto be minimized for IP based storage solutions to have comparableperformance and latency as other specialized network architectures likefibre channel, which are specified with a view to a hardwareimplementation. Emerging IP based storage standards like iSCSI, FCIP,iFCP, and others (like NFS, CIFS, DAFS, HTTP, XML, XML derivatives (suchas Voice XML, EBXML, Microsoft SOAP and others), SGML, and HTML formats)encapsulate the storage and data traffic in TCP/IP segments. However,there usually isn't alignment relationship between the TCP segments andthe protocol data units that are encapsulated by TCP packets. Thisbecomes an issue when the packets arrive out of order, which is a veryfrequent event in today's networks. The storage and data blocks cannotbe extracted from the out of order packets for use until theintermediate packets in the stream arrive which will cause the networkadapters to store these packets in the memory, retrieve them and orderthem when the intermediate packets arrive. This can be expensive fromthe size of the memory storage required and also the performance thatthe memory subsystem is expected to support, particularly at line ratesabove 1 Gbps. This overhead can be removed if each TCP segment canuniquely identify the protocol data unit and its sequence. This canallow the packets to be directly transferred to their end memorylocation in the host system. Host processor intervention should also beminimized in the transfer of large blocks of data that may betransferred to the storage subsystems or being shared with otherprocessors in a clustering environment or other client serverenvironment. The processor should be interrupted only on storage commandboundaries to minimize the impact.

The IP processor set forth herein eliminates or sharply reduces theeffect of various issues outlined above through innovative architecturalfeatures and the design. The described processor architecture providesfeatures to terminate the TCP traffic carrying the storage and datapayload thereby eliminating or sharply reducing the TCP/IP networkingstack overhead on the host processor, resulting in packet streamingarchitecture that allows packets to pass through from input to outputwith minimal latency. To enable high line rate storage or data trafficbeing carried over IP requires maintaining the transmission controlblock information for various connections (sessions) that aretraditionally maintained by host kernel or driver software. As used inthis patent, the term “IP session” means a session for a sessionoriented protocol that runs on IP. Examples are TCP/IP, SCTP/IP, and thelike. Accessing session information for each packet adds significantprocessing overhead. The described architecture creates a highperformance memory subsystem that significantly reduces this overhead.The architecture of the processor provides capabilities for intelligentflow control that minimizes interrupts to the host processor primarilyat the command or data transfer completion boundary.

Today, no TCP/IP processor is offered with security.

The conventional network security model deployed today involvesperimeter security in the form of perimeter firewall and intrusiondetection systems. However, as increasing amount of business getsconducted on-line, there is a need to provide enterprise network accessto “trusted insiders”—employees, partners, customers and contractorsfrom outside. This creates potential threats to the information assetsinside an enterprise network. Recent research by leading firms and FBIfound that over 70 percent of the unauthorized access to informationsystems is committed by employees or trusted insiders and so are over 95percent of intrusions that result in substantial financial loss. In anenvironment where remote access servers, peer networks with partners,VPN and wireless access points blur the boundary of the network, aperimeter security is not sufficient. In such an environmentorganizations need to adopt an integrated strategy that addressesnetwork security at all tiers including at the perimeter, gateways,servers, switches, routers and clients instead of using point securityproducts at the perimeter.

Traditional firewalls provide perimeter security at network layers bykeeping offending IP addresses out of the internal network. However,because many new attacks arrive as viruses or spam, exploiting knownvulnerabilities of well-known software and higher level protocols, it isdesirable to develop and deploy application layer firewalls. Theseshould also be distributed across the network instead of being primarilyat the perimeter.

Currently as the TCP/IP processing exists as the software stack inclients, servers and other core and end systems, the security processingalso is done in software particularly the capabilities like firewall,intrusion detection and prevention. As the line rates of these networksgo to 1 Gbps and 10 Gbps, it is imperative that the TCP/IP protocolstack be implemented in hardware because a software stack consumes alarge portion of the available host processor cycles. Similarly, if thesecurity processing functions get deployed on core or end systemsinstead of being deployed only at the perimeter, the processing powerrequired to perform these operations may create a huge overhead on thehost processor of these systems. Hence software based distributedsecurity processing would increase the required processing capability ofthe system and increase the cost of deploying such a solution. Asoftware based implementation would be detrimental to the performance ofthe servers and significantly increase the delay or latency of theserver response to clients and may limit the number of clients that canbe served. Further, if the host system software stack gets compromisedduring a network attack, it may not be possible to isolate the securityfunctions, thereby compromising network security. Further, as the TCP/IPprotocol processing comes to be done in hardware, the software networklayer firewalls may not have access to all state information needed toperform the security functions. Hence, the protocol processing hardwaremay be required to provide access to the protocol layer information thatit processes and the host may have to redo some of the functions to meetthe network firewall needs.

The hardware based TCP/IP and security rules processing processor ofthis patent solves the distributed core security processing bottleneckbesides solving the performance bottleneck from the TCP/IP protocolstack. The hardware processor of this patent sharply reduces the TCP/IPprotocol stack processing overhead from the host CPU and enablessecurity processing features like firewall at various protocol layerssuch as link, network and transport layers, thereby substantiallyimproving the host CPU performance for intended applications. Further,this processor provides capabilities that can be used to perform deeppacket inspection to perform higher layer security functions using theprogrammable processor and the classification/policy engines disclosed.This patent also describes a content inspection architecture that may beused for detecting spam, viruses, digital rights management information,instant message inspection, URL matching, application detection,malicious content, and other content and applying specific rules whichmay enable anti-spam, anti-virus and the like security and contentinspection and processing capabilities. The processor of this patentthus enables hardware TCP/IP and security processing at all layers ofthe OSI stack to implement capabilities like firewall at all layersincluding the network layer and application layers.

The processor architecture of this patent also provides integratedadvanced security features. This processor allows for in-streamencryption and decryption of the network traffic on a packet by packetbasis thereby allowing high line rates and at the same time offeringconfidentiality of the data traffic. Similarly, when the storage trafficis carried on a network from the server to the storage arrays in a SANor other storage system, it is exposed to various securityvulnerabilities that a direct attached storage system does not have todeal with. This processor allows for in stream encryption and decryptionof the storage traffic thereby allowing high line rates and at the sametime offering confidentiality of the storage data traffic.

Classification of network traffic is another task that consumes up tohalf of the processing cycles available on packet processors leaving fewcycles for deep packet inspection and processing. IP based storagetraffic by the nature of the protocol requires high speed low latencydeep packet processing. The described IP processor significantly reducesthe classification overhead by providing a programmable classificationengine. The programmable classification engine of this patent allowsdeployment of advanced security policies that can be enforced on a perpacket, per transaction, and per flow basis. This will result insignificant improvement in deploying distributed enterprise securitysolutions in a high performance and cost effective manner to address theemerging security threats from within the organizations.

To enable the creation of distributed security solutions, it is criticalto address the need of Information Technology managers to costeffectively manage the entire network. Addition of distributed security,without means for ease of managing it can significantly increase themanagement cost of the network. The disclosure of this patent alsoprovides a security rules/policy management capability that can be usedby IT personnel to distribute the security rules from a centralizedlocation to various internal network systems that use the processor ofthis patent. The processor comprises hardware and software capabilitiesthat can interact with centralized rules management system(s). Thus thedistribution of the security rules and collection of information ofcompliance or violation of the rules or other related information likeoffending systems, users and the like can be processed from one or morecentralized locations by IT managers. Thus multiple distributed securitydeployments can be individually controlled from centralized location(s).

This patent also provides means to create a secure operating environmentfor the protocol stack processing that, even if the host system getscompromised either through a virus or malicious attack, allows thenetwork security and integrity to be maintained. This patentsignificantly adds to the trusted computing environment needs of thenext generation computing systems.

Tremendous growth in the storage capacity and storage networks havecreated storage area management as a major cost item for IT departments.Policy based storage management is required to contain management costs.The described programmable classification engine allows deployment ofstorage policies that can be enforced on packet, transaction, flow andcommand boundaries. This will have significant improvement in storagearea management costs.

The programmable IP processor architecture also offers enough headroomto allow customer specific applications to be deployed. Theseapplications may belong to multiple categories e.g. network management,storage firewall or other security capabilities, bandwidth management,quality of service, virtualization, performance monitoring, zoning, LUNmasking and the like.

The adaptable processor hardware may be used to accelerate many of theapplications or services listed above based on the availablereprogrammable resources, deployed applications, services, policies or acombination thereof.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a layered SCSI architecture and interaction betweenrespective layers located between initiator and target systems.

FIG. 2 illustrates the layered SCSI architecture with iSCSI and TCP/IPbased transport between initiator and target systems.

FIG. 3 illustrates an OSI stack comparison of software based TCP/IPstack with hardware-oriented protocols like Fibre channel.

FIG. 4 illustrates an OSI stack with a hardware based TCP/IPimplementation for providing performance parity with the other non-IPhardware oriented protocols.

FIG. 5 illustrates a host software stack illustrating operating systemlayers implementing networking and storage stacks.

FIG. 6 illustrates software TCP stack data transfers.

FIG. 7 illustrates remote direct memory access data transfers usingTCP/IP offload from the host processor as described in this patent.

FIG. 8 illustrates host software SCSI storage stack layers fortransporting block storage data over IP networks.

FIG. 9 illustrates certain iSCSI storage network layer stack details ofan embodiment of the invention.

FIG. 10 illustrates TCP/IP network stack functional details of anembodiment of the invention.

FIG. 11 illustrates an iSCSI storage data flow through various elementsof an embodiment of the invention.

FIG. 12 illustrates iSCSI storage data structures useful in theinvention.

FIG. 13 illustrates a TCP/IP Transmission Control Block data structurefor a session database entry useful in an embodiment of the invention.

FIG. 14 illustrates an iSCSI session database structure useful in anembodiment of the invention.

FIG. 15 illustrates iSCSI session memory structure useful in anembodiment of the invention.

FIG. 16 illustrates a high-level architectural block diagram of an IPnetwork application processor useful in an embodiment of the invention.

FIG. 17 illustrates a detailed view of the architectural block diagramof the IP network application processor of FIG. 16.

FIG. 18 illustrates an input queue and controller for one embodiment ofthe IP processor.

FIG. 19 illustrates a packet scheduler, sequencer and load balanceruseful in one embodiment of the IP processor.

FIG. 20 illustrates a packet classification engine, including a policyengine block of one embodiment of the IP storage processor.

FIG. 21 broadly illustrates an embodiment of the SAN packet processorblock of one embodiment of an IP processor at a high-level.

FIG. 22 illustrates an embodiment of the SAN packet processor block ofthe described IP processor in further detail.

FIG. 23 illustrates an embodiment of the programmable TCP/IP processorengine which can be used as part of the described SAN packet processor.

FIG. 24 illustrates an embodiment of the programmable IP Storageprocessor engine which can be used as part of the described SAN packetprocessor.

FIG. 25 illustrates an embodiment of an output queue block of theprogrammable IP processor of FIG. 17.

FIG. 26 illustrates an embodiment of the storage flow controller andRDMA controller.

FIG. 27 illustrates an embodiment of the host interface controller blockof the IP processor useful in an embodiment of the invention.

FIG. 28 illustrates an embodiment of the security engine.

FIG. 29 illustrates an embodiment of a memory and controller useful inthe described processor.

FIG. 30 illustrates a data structure useable in an embodiment of thedescribed classification engine.

FIG. 31 illustrates a storage read flow between initiator and target.

FIG. 32 illustrates a read data packet flow through pipeline stages ofthe described processor.

FIG. 33 illustrates a storage write operation flow between initiator andtarget.

FIG. 34 illustrates a write data packet flow through pipeline stages ofthe described processor.

FIG. 35 illustrates a storage read flow between initiator and targetusing the remote DMA (RDMA) capability between initiator and target.

FIG. 36 illustrates a read data packet flow between initiator and targetusing RDMA through pipeline stages of the described processor.

FIG. 37 illustrates a storage write flow between initiator and targetusing RDMA capability.

FIG. 38 illustrates a write data packet flow using RDMA through pipelinestages of the described processor.

FIG. 39 illustrates an initiator command flow in more detail throughpipeline stages of the described processor.

FIG. 40 illustrates a read packet data flow through pipeline stages ofthe described processor in more detail.

FIG. 41 illustrates a write data flow through pipeline stages of thedescribed processor in more detail.

FIG. 42 illustrates a read data packet flow when the packet is in ciphertext or is otherwise a secure packet through pipeline stages of thedescribed processor.

FIG. 43 illustrates a write data packet flow when the packet is incipher text or is otherwise a secure packet through pipeline stages ofthe described processor of one embodiment of the invention.

FIG. 44 illustrates a RDMA buffer advertisement flow through pipelinestages of the described processor.

FIG. 45 illustrates a RDMA write flow through pipeline stages of thedescribed processor in more detail.

FIG. 46 illustrates a RDMA Read data flow through pipeline stages of thedescribed processor in more detail.

FIG. 47 illustrates steps of a session creation flow through pipelinestages of the described processor.

FIG. 48 illustrates steps of a session tear down flow through pipelinestages of the described processor.

FIG. 49 illustrates a session creation and session teardown steps from atarget perspective through pipeline stages of the described processor.

FIG. 50 illustrates an R2T command flow in a target subsystem throughpipeline stages of the described processor.

FIG. 51 illustrates a write data flow in a target subsystem throughpipeline stages of the described processor.

FIG. 52 illustrates a target read data flow through the pipeline stagesof the described processor.

FIG. 53 illustrates a typical enterprise network with perimetersecurity.

FIG. 54 illustrates an enterprise network with distributed securityusing various elements of this patent.

FIG. 55 illustrates an enterprise network with distributed securityincluding security for a storage area network using various elements ofthis patent.

FIG. 56 illustrates a Central Manager/Policy Server & MonitoringStation.

FIG. 57 illustrates Central Manager flow of the disclosed securityfeature.

FIG. 58 illustrates rule distribution flow for the Central Manager.

FIG. 59 illustrates Control Plane Processor/Policy Driver Flow for theprocessor of this patent.

FIG. 60 illustrates a sample of packet filtering rules that may bedeployed in distributed security systems.

FIG. 61 illustrates a TCP/IP processor version of the IP processor ofFIGS. 16 and 17.

FIG. 62 illustrates an adaptable TCP/IP processor.

FIG. 63 illustrates an adaptable TCP/IP processor useable as analternate to that of FIG. 62.

FIG. 64 illustrates a runtime adaptable processor.

FIG. 65 illustrates a compute cluster.

FIG. 66 illustrates a security solution for providing security in anetwork

FIG. 67 illustrates a security solution compiler flow

FIG. 68 illustrates a flow-through secure network card architecture

FIG. 69 illustrates a network line card with look-aside securityarchitecture

FIG. 70 illustrates a security and content search accelerator adapterarchitecture

FIG. 71 illustrates a security processor architecture

FIG. 72 illustrates an alternate security processor architecture

FIG. 73 illustrates a third security processor architecture

FIG. 74 illustrates a content search and rule processing enginearchitecture

FIG. 75 illustrates an example of Regular Expression Rules (prior art)

FIG. 76 a illustrates Thompson's NFA (prior art)

FIG. 76 b illustrates Berry-Sethi NFA (prior art)

FIG. 76 c illustrates DFA (prior art)

FIG. 77 a illustrates a left-biased NFA and state transition table(prior art)

FIG. 77 b illustrates a right-biased NFA and state transition table(prior art)

FIG. 78 a illustrates state transition controls

FIG. 78 b illustrates configurable next state tables per state

FIG. 79 a illustrates state transition logic (STL) for a state

FIG. 79 b illustrates a state logic block

FIG. 80 illustrates a NFA based search engine

FIG. 81 illustrates application state memory example configuration

FIG. 82 illustrates a DFA based search engine

FIG. 83 illustrates example DFA operations

FIG. 84 illustrates a runtime adaptable search processor

FIG. 85 illustrates a runtime adaptable search processor array

FIG. 86 illustrates an example search cluster

FIG. 87 illustrates Intra search cluster rule groups and switchingexample

FIG. 88 illustrates an example rules configuration and context switch

FIG. 89 illustrates a computing device with content search accelerator

FIG. 90 illustrates an example anti-spam performance bottleneck andsolution

FIG. 91 illustrates anti-spam with anti-virus performance bottleneck

FIG. 92 illustrates application content search performance bottleneckand solution

FIG. 93 illustrates an example content search API usage model

FIG. 94 illustrates an example content search API with example functions

FIG. 95 illustrates an example application flow (static setup) usingsearch processor

FIG. 96 illustrates FSA compiler flow

FIG. 97 illustrates search compiler flow (full+incremental ruledistribution)

FIG. 98 illustrates FSA Synthesis & Compiler Flow for FPGA

DETAILED DESCRIPTION OF THE INVENTION

I provide a new high performance content search processor that canrelieve the performance bottleneck of content search from the hostprocessor. The search processor of this patent is used to perform alarge number of regular expression searches in parallel using NFA andDFA based search engines. The search processor can perform high speedcontent search at line rates from 1 Gbps to 10 Gbps and higher, when thebest of class server microprocessor can only perform the same tasks atwell below 100 Mbps. The content search processor can be used not onlyto perform layer 2 through layer 4 searches that may be used forclassification and security applications, it can also be used to performdeep packet inspection and layer 4 through layer 7 content analysis andsecurity applications.

I provide a new high performance and low latency way of implementing aTCP/IP stack in hardware to relieve the host processor of the severeperformance impact of a software TCP/IP stack. This hardware TCP/IPstack is then interfaced with additional processing elements to enablehigh performance and low latency IP based storage applications.

This system also enables a new way of implementing security capabilitieslike firewall inside enterprise networks in a distributed manner using ahardware TCP/IP implementation with appropriate security capabilities inhardware having processing elements to enable high performance and lowlatency IP based network security applications. The hardware processorsmay be used inside network interface cards of servers, workstations,client PCs, notebook computers, handheld devices, switches, routers andother networked devices. The servers may be web servers, remote accessservers, file servers, departmental servers, storage servers, networkattached storage servers, database servers, blade servers, clusteringservers, application servers, content/media servers, gridcomputers/servers, and the like. The hardware processors may also beused inside an I/O chipset of one of the end systems or network coresystems like a switch or router or appliance or the like.

This system enables distributed security capabilities like firewall,intrusion detection, virus scan, virtual private network,confidentiality services and the like in internal systems of anenterprise network. The distributed security capabilities may beimplemented using the hardware processor of this patent in each system,or some of its critical systems and others may deploy those services insoftware. Hence, overall network will include distributed security ashardware implementation or software implementation or a combinationthereof in different systems depending on the performance, cost andsecurity needs as determined by IT managers. The distributed securitysystems will be managed from one or more centralized systems used by ITmanagers for managing the network using the principles described. Thiswill enable an efficient and consistent deployment of security in thenetwork using various elements of this patent.

This can be implemented in a variety of forms to provide benefits ofTCP/IP termination, high performance and low latency IP storagecapabilities, remote DMA (RDMA) capabilities, security capabilities,programmable classification, policy processing features, runtimeadaptable processing, high speed content search and the like. Followingare some of the embodiments that can implement this:

Server

The described architecture may be embodied in a high performance serverenvironment providing hardware based TCP/IP functions or hardware TCP/IPand security or search functions or a combination of the foregoing thatrelieve the host server processor or processors of TCP/IP and/orsecurity and/or search software and their performance overhead. Anyprocessor of this invention may be a companion processor to a serverchipset, providing the high performance networking interface withhardware TCP/IP and/or security and/or content search. Servers can be invarious form factors like blade servers, appliance servers, fileservers, thin servers, clustered servers, database server, contentsearch server, game server, grid computing server, VoIP server, wirelessgateway server, security server, network attached storage server ortraditional servers. The current embodiment would allow creation of ahigh performance network interface on the server motherboard.

Further, the described runtime adaptable protocol processor or securityprocessor or search processor or a combination of the foregoingarchitectures may also be used to provide additional capabilities orservices beside protocol processing like socket layer security,Transport layer security, encryption/decryption, RDMA, RDMA security,application layer security, virtualization or higher application layerprocessing which may further involve application level protocolprocessing (for example, protocol processing for HTTP, HTTPS, XML, SGML,Secure XML, other XML derivatives, Telnet, FTP, IP Storage, NFS, CIFS,DAFS, and the like). One embodiment could include TCP/IP protocolprocessing using the dedicated protocol processor and XML accelerationmapped to a runtime adaptable processor such as that disclosed in thispatent. The protocol processor may or may not provide RDMA capabilitiesdependent upon the system needs and the supported line rates. Securityprocessing capabilities of this invention may also be optionallyincorporated in this embodiment. The same architecture could also beused to provide security acceleration support to XML data processing onthe runtime adaptable processor of this patent.

Companion Processor to a Server Chipset

The server environment may also leverage the high performance IP storageprocessing capability of the described processor, besides highperformance TCP/IP and/or RDMA capabilities. In such an embodiment theprocessor may be a companion processor to a server chipset providinghigh performance network storage I/O capability besides the TCP/IPoffloading from the server processor. This embodiment would allowcreation of high performance IP based network storage I/O on themotherboard. In other words it would enable IP SAN on the motherboard.

Similar to the Server embodiment described above, this embodiment mayalso leverage the runtime adaptable processor of this patent to provideadaptable hardware acceleration along with protocol processing supportin a server chipset. The runtime adaptable processor can be configuredto provide storage services like virtualization, security services,multi-pathing, protocol translation and the like. Protocol translationmay included for example translation to/from FibreChannel protocol to IPStorage protocol or vice versa, Serial ATA protocol to IP Storage orFibreChannel protocol or vice-versa, Serial Attached SCSI protocol to IPStorage or FibreChannel protocol or vice-versa, and the like.

Further, the runtime adaptable search processor of this patent mayprovide content search acceleration capability to a server chipsetsimilar to the server embodiment described above. The content searchprocessor may be used for a vast variety of applications that useregular expressions for the search rules. The applications cover aspectrum of areas like bioinformatics, application and network layersecurity, proteomics, genetics, web search, protocol analysis, XMLacceleration, data warehousing and the like.

Storage System Chipsets

The processor may also be used as a companion of a chipset in a storagesystem, which may be a storage array (or some other appropriate storagesystem or subsystem) controller, which performs the storage data serverfunctionality in a storage networking environment. The processor wouldprovide IP network storage capability to the storage array controller tonetwork in an IP based SAN. The configuration may be similar to that ina server environment, with additional capabilities in the system toaccess the storage arrays and provide other storage-centricfunctionality.

This embodiment may also leverage the runtime adaptable processor ofthis patent to provide adaptable hardware acceleration along withprotocol processing support in a storage system chipset. The runtimeadaptable processor can be configured to provide storage services likevirtualization, security services, multi-pathing, protocol translationand the like. Protocol translation may included for example translationto/from FibreChannel protocol to IP Storage protocol or vice versa,Serial ATA protocol to IP Storage or FibreChannel protocol orvice-versa, Serial Attached SCSI protocol to IP Storage or FibreChannelprotocol or vice-versa, and the like. The runtime adaptable processormay also be used to provide acceleration for storage system metadataprocessing to improve the system performance.

This embodiment may also leverage the runtime adaptable search processorof this patent to provide high speed search of storage traffic forvarious applications like confidential information policy enforcement,virus/malware detection, high speed indexing for ease of searching,virtualization, content based switching, security services and the like.

Server/Storage Host Adapter Card

The IP processor may also be embedded in a server host adapter cardproviding high speed TCP/IP networking. The same adapter card may alsobe able to offer high speed network security capability for IP networks.Similarly, the adapter card may also be able to offer high speed networkstorage capability for IP based storage networks. The adapter card maybe used in traditional servers and may also be used as blades in a bladeserver configuration. The processor may also be used in adapters in astorage array (or other storage system or subsystem) front end providingIP based storage networking capabilities. The adapter card may alsoleverage the runtime adaptable processors of this patent in a waysimilar to that described above.

Processor Chipset Component

The TCP/IP processor may be embodied inside a processor chipset,providing the TCP/IP offloading capability. Such a configuration may beused in the high end servers, workstations or high performance personalcomputers that interface with high speed networks. Such an embodimentcould also include IP storage or RDMA capabilities or combination ofthis invention to provide IP based storage networking and/or TCP/IP withRDMA capability embedded in the chipset. The usage of multiplecapabilities of the described architecture can be made independent ofusing other capabilities in this or other embodiments, as a trade-off offeature requirements, development timeline and cost, silicon die cost,and the like. The processor chipset may also incorporate the runtimeadaptable processors of this patent to offer a variable set of functionson demand by configuring the processor for the desired application.

Storage or SAN System or Subsystem Switching Line Cards

The IP processor may also be used to create high performance, lowlatency IP SAN switching system (or other storage system or subsystem)line cards. The processor may be used as the main processor terminatingand originating IP-based storage traffic to/from the line card. Thisprocessor would work with the switching system fabric controller, whichmay act like a host, to transport the terminated storage traffic, basedon their IP destination, to the appropriate switch line card asdetermined by the forwarding information base present in the switchsystem. Such a switching system may support purely IP based networkingor may support multi-protocol support, allow interfacing with IP basedSAN along with other data center SAN fabrics like Fibre channel. A verysimilar configuration could exist inside a gateway controller system,that terminates IP storage traffic from LAN or WAN and originates newsessions to carry the storage traffic into a SAN, which may be IP basedSAN or more likely a SAN built from other fabrics inside a data centerlike Fibre channel. The processor could also be embodied in a SANgateway controller. These systems would use security capabilities ofthis processor to create a distributed security network withinenterprise storage area networks as well.

The runtime adaptable processor of this patent can be very effective inproviding hardware acceleration capabilities individually or incombination as described above like protocol translation,virtualization, security, bandwidth management, rate limiting, grooming,policy based management and the like

The runtime adaptable search processor of this patent can be used toprovide deep content inspection capability in this embodiment which canbe used to created intelligent networking capabilities like contentbased switching, application oriented networking and the like.

Network Switches, Routers, Wireless Access Points

The processor may also be embedded in a network interface line cardproviding high speed TCP/IP networking for switches, routers, gateways,wireless access points and the like. The same adapter card may also beable to offer high speed network security capability for IP networks.This processor would provide the security capabilities that can then beused in a distributed security network.

The runtime adaptable processor of this patent may also be used in suchembodiments offering services and capabilities described above as wellas others like Wired Equivalent Privacy security capabilities, RADIUSand like security features as needed by the environment. The runtimeadaptable processor may also be used to provide dynamically changeableprotocol processing capability besides TCP/IP processing to supportwireless protocols like Bluetooth, HomeRF, wireless Ethernet LANprotocols at various line rates, 3GPP, GPRS, GSM, or other wireless LANor RF or cellular technology protocols, or any combinations thereof

The runtime adaptable search processor of this patent can be used toprovide deep content inspection capability in this embodiment which canbe used to created intelligent networking capabilities like contentbased switching, application oriented networking and the like.

Storage Appliance

Storage networks management costs are increasing rapidly. The ability tomanage the significant growth in the networks and the storage capacitywould require creating special appliances which would be providing thestorage area management functionality. The described managementappliances for high performance IP based SAN, would implement my highperformance IP processor, to be able to perform its functions on thestorage traffic transported inside TCP/IP packets. These systems wouldrequire a high performance processor to do deep packet inspection andextract the storage payload in the IP traffic to provide policy basedmanagement and enforcement functions. The security, programmableclassification and policy engines along with the high speed TCP/IP andIP storage engines described would enable these appliances and otherembodiments described in this patent to perform deep packet inspectionand classification and apply the policies that are necessary on a packetby packet basis at high line rates at low latency. Further thesecapabilities can enable creating storage management appliances that canperform their functions like virtualization, policy based management,security enforcement, access control, intrusion detection, bandwidthmanagement, traffic shaping, quality of service, anti-spam, virusdetection, encryption, decryption, LUN masking, zoning, link aggregationand the like in-band to the storage area network traffic. Similar policybased management, and security operations or functionality may also besupported inside the other embodiments described in this patent. Theruntime adaptable processor of this patent can be used to dynamicallysupport or accelerate one or more of the applications/services. Theservices/applications supported may be selected by the policies inexistence under the influence or control of the user or theadministrator. The runtime adaptable search processor of this inventioncan be used to perform high speed network line rate searches in thisembodiment which can be used for several deep packet inspection, contentsearch and security applications discussed above.

Clustered Environments

Server systems are used in a clustered environment to increase thesystem performance and scalability for applications like clustered databases and the like. The applications running on high performance clusterservers require ability to share data at high speeds for inter-processcommunication. Transporting this inter-process communication traffic ona traditional software TCP/IP network between cluster processors suffersfrom severe performance overhead. Hence, specialized fabrics like Fibrechannel have been used in such configurations. However, a TCP/IP basedfabric which can allow direct memory access between the communicatingprocesses' memory, can be used by applications that operate on anyTCP/IP network without being changed to specialized fabrics like fibrechannel. The described IP processor with its high performance TCP/IPprocessing capability and the RDMA features, can be embodied in acluster server environment to provide the benefits of high performanceand low latency direct memory to memory data transfers. This embodimentmay also be used to create global clustering and can also be used toenable data transfers in grid computers and grid networks. The processorof this patent may also be used to accelerate local cluster datatransfers using light weight protocols other than TCP/IP to avoid thelatency and protocol processing overhead. The runtime adaptableprocessor architecture can be leveraged to support such a light weightprotocol. Thus the same processor architecture may be used for local aswell as global clustering and enable data transfers in grid computersand grid networks. The programmable processor of this patent may also beused for similar purposes without burdening the runtime adaptableprocessor. The processor architecture of this patent can thus be used toenable utility computing. The runtime adaptable processors of thispatent may also be used to provide the capabilities described in otherembodiments above to the clustered environment as well.

XML Accelerator

The runtime adaptable TCP/IP processor of this patent can also be usedas a component inside a system or adapter card or as part of a chipsetproviding TCP/IP protocol termination or XML acceleration or acombination thereof. As web services usage increases, more and more webdocuments may start using XML or XML derivatives. The burden ofprocessing XML on each web page access can be very significant on thehost processors, requiring additional hardware support. The runtimeadaptable processor of this patent can be used in such an environment toprovide acceleration to XML processing, whereas transport protocolprocessing is handled by the dedicated protocol processor of thispatent. XML documents may also need security support, in which case theprocessor can be dynamically configured to provide security accelerationfor secure XML documents. Similarly, the runtime adaptable searchprocessor of this invention can also be used in such an embodiment, andprovide high speed XML processing capability to meet 1 Gbps, 10 Gbps andhigher line rates. The search processor of this invention allowsmultiple thousand regular expression rules to be simultaneouslyevaluated for this application and can thus be used in networkingsystems also to provide intelligent networking and application orientednetworking capabilities.

Voice Over IP (VoIP) Appliances

The processor of this patent can also be embedded inside voice over IPappliances like VoIP phones, servers, gateways, handheld devices, andthe like. The protocol processor can be used to provide IP protocolprocessing, as well as the transport layer protocol processing as neededin the VoIP environment. Further, the runtime adaptable processor may bedynamically adapted to provide signal processing and DSP hardwareacceleration capabilities that may be required for VoIP appliance andthe applications running on the appliance. The runtime adaptable searchprocessor of this application can be used for speech recognitionapplications beside the VoIP applications similar to the discussionabove.

Handheld Devices

The processor of this patent may also be used to provide protocolprocessing hardware capability to processors or chipsets of handhelddevices, phones, personal digital assistants and the like. The protocolprocessor along with the runtime adaptable processor may provide many ofthe capabilities described above for many of the embodiments. Theprocessor of this patent may be used to create a secure protocolprocessing stack inside these devices as well as provide other servicesusing hardware acceleration. The runtime adaptable processor may be usedto enable the handheld devices to network in a wired or wireless manner.The device can then be dynamically adapted to work with a multitude ofprotocols like Bluetooth, Wireless Ethernet LAN, RF, GPRS, GSM, CDMA,CMDA variants or other 3G cellular technology or other wireless orcellular or RF technologies by using the protocol processor and theruntime adaptable processor of this patent. The runtime adaptable searchprocessor of this patent may also be embodied in handheld devices toaccelerate the search performance of the already low performanceprocessors. The search processor of this application may be incorporatedin a chipset or a processor of the handheld device to accelerate thesearch performance of the applications running on these platforms. Thesearch processor may also be used as a multi-purpose security processoroffering cryptographic security, network layer security functionsthrough application layer security that may be embodied in a chipset orthe host processor or as a companion processor in the handheld devicesto protect against the emerging security threats for a variety ofhandheld devices.

Additional Embodiments

The processor architecture can be partially implemented in software andpartially in hardware. The performance needs and cost implications candrive trade-offs for hardware and software partitioning of the overallsystem architecture of this invention. It is also possible to implementthis architecture as a combination of chip sets along with the hardwareand software partitioning or independent of the partitioning. Forexample the security processor and the classification engines could beon separate chips and provide similar functions. This can result inlower silicon cost of the IP processor including the development andmanufacturing cost, but it may in some instances increase the part countin the system and may increase the footprint and the total solutioncost. Security and classification engines could be separate chips aswell. As used herein, a chip set may mean a multiple-chip chip set, or achip set that includes only a single chip, depending on the application.

The storage flow controller and the queues could be maintained insoftware on the host or may become part of another chip in the chipset.Hence, multiple ways of partitioning this architecture are feasible toaccomplish the high performance IP based storage and TCP/IP offloadapplications that will be required with the coming high performanceprocessors in the future. The storage engine description has been givenwith respect to iSCSI, however, with TCP/IP and storage engineprogrammability, classifier programmability and the storage flowcontroller along with the control processor, other IP storage protocolslike iFCP, FCIP and others can be implemented with the appropriatefirmware. iSCSI operations may also represent IP Storage operations. Thehigh performance IP processor core may be coupled with multiple inputoutput ports of lower line rates, matching the total throughput tocreate multi-port IP processor embodiment as well.

It is feasible to use this architecture for high performance TCP/IPoffloading from the main processor without using the storage engines.This can result in a silicon and system solution for next generationhigh performance networks for the data and telecom applications. TheTCP/IP engine can be augmented with application specific packetaccelerators and leverage the core architecture to derive new flavors ofthis processor. It is possible to change the storage engine with anotherapplication specific accelerator like a firewall engine or a routelook-up engine or a telecom/network acceleration engine, along with theother capabilities of this invention and target this processorarchitecture for telecom/networking and other applications.

The runtime adaptable search processor of this invention may also beembodied inside a variety of application specific systems targetedtowards security or content search application or the like. For example,the search processor may be embodied inside an email security appliancethat provides capabilities like anti-spam, anti-virus, anti-spyware,anti-worms, other malware prevention, as well as confidentialinformation security and other regulatory compliance for regulationslike Sarbanes-Oxley, Gramm-Leach-Bliley, HIPAA and the like. The searchprocessor may also be embodied in a system that may be used forextrusion detection and prevention systems to protect againstconfidential information leaks and other regulatory compliance likethose listed above at an enterprise edge.

Detailed Description

Ability to perform content search has become a critical capability inthe networked world. As the network line rates go up to 1 Gbps, 10 Gbpsand higher, it is important to be able to perform deep packet inspectionfor many applications at line rate. Several security issues, likeviruses, worms, confidential information leaks and the like, can bedetected and prevented from causing damage if the network traffic can beinspected at high line rates. In general, content search rules can berepresented using regular expressions. Regular expression rules can berepresented and computed using FSAs. NFAs and DFAs are the two types ofFSAs that are used for evaluation of regular expressions. For high linerate applications a composite DFA can be used, where each character ofthe input stream can be processed per cycle of memory access. However,this does have a limit on how fast the search can be performed dictatedby the memory access speed. Another limiter of such approach is theamount of memory required to search even a modest number of regularexpression rules. As discussed above, NFAs also have their limitationsto achieve high performance on general purpose processors. In general,today's best of class microprocessors can only achieve less than 100Mbps performance using NFAs or DFAs for a small number of regularexpressions. Hence, there is a clear need to create targeted contentsearch processor hardware to raise the performance of the search to theline rates of 1 Gbps and 10 Gbps. This invention shows such a highperformance content search processor that can be targeted for high linerates.

Storage costs and demand have been increasing at a rapid pace over thelast several years. This is expected to grow at the same rate in theforeseeable future. With the advent of e-business, availability of thedata at any time and anywhere irrespective of the server or systemdowntime is critical. This is driving a strong need to move the serverattached storage onto a network to provide storage consolidation,availability of data and ease of management of the data. The storagearea networks (SANs) are today predominantly based on Fibre Channeltechnology, that provide various benefits like low latency and highperformance with its hardware oriented stacks compared to TCP/IPtechnology.

Some system transport block storage traffic on IP designed to transportdata streams. The data streams are transported using TransmissionControl Protocol (TCP) that is layered to run on top of IP. TCP/IP is areliable connection oriented protocol implemented in software within theoperating systems. A TCP/IP software stack is slow to handle the highline rates that will be deployed in the future. New hardware solutionswill accelerate the TCP/IP stack to carry storage and network trafficand be competitive to FC based solutions.

The prevalent storage protocol in high performance servers, workstationsand storage controllers and arrays is SCSI protocol which has beenaround for 20 years. SCSI architecture is built as layered protocolarchitecture. FIG. 1 illustrates the various SCSI architecture layerswithin an initiator, block 101, and target subsystems, block 102. Asused in patent, the terms “initiator” and “target” mean a dataprocessing apparatus, or a subsystem or system including them. The terms“initiator” and “target” can also mean a client or a server or a peer.Likewise, the term “peer” can mean a peer data processing apparatus, ora subsystem or system thereof. A “remote peer” can be a peer locatedacross the world or across the room.

The initiator and target subsystems in FIG. 1 interact with each otherusing the SCSI application protocol layer, block 103, which is used toprovide a client-server request and response transactions. It alsoprovides device service request and response between the initiator andthe target mass storage device which may take many forms like a diskarrays, tape drives, and the like. Traditionally, the target andinitiator are interconnected using the SCSI bus architecture carryingthe SCSI protocol, block 104. The SCSI protocol layer is the transportlayer that allows the client and the server to interact with each otherusing the SCSI application protocol. The transport layer must presentthe same semantics to the upper layer so that the upper layer protocolsand application can stay transport protocol independent.

FIG. 2 illustrates the SCSI application layer on top of IP basedtransport layers. An IETF standards track protocol, iSCSI (SCSI over IP)is an attempt to provide IP based storage transport protocol. There areother similar attempts including FCIP (FC encapsulated in IP), iFCP (FCover IP) and others. Many of these protocols layer on top of TCP/IP asthe transport mechanism, in a manner similar to that illustrated in FIG.2. As illustrated in FIG. 2, the iSCSI protocol services layer, block204, provides the layered interface to the SCSI application layer, block203. iSCSI carries SCSI commands and data as iSCSI protocol data units(PDUs) as defined by the standard. These protocol data units then can betransported over the network using TCP/IP, block 205, or the like. Thestandard does not specify the means of implementing the underlyingtransport that carries iSCSI PDUs. FIG. 2 illustrates iSCSI layered onTCP/IP which provides the transport for the iSCSI PDUs.

The IP based storage protocol like iSCSI can be layered in software ontop of a software based TCP/IP stack. However, such an implementationwould suffer serious performance penalties arising from software TCP/IPand the storage protocol layered on top of that. Such an implementationwould severely impact the performance of the host processor and may makethe processor unusable for any other tasks at line rates above 1 Gbps.Hence, we would implement the TCP/IP stack in hardware, relieving thehost processor, on which the storage protocol can be built. The storageprotocol, like iSCSI, can be built in software running on the hostprocessor or may, as described in this patent, be accelerated usinghardware implementation. A software iSCSI stack will present manyinterrupts to the host processor to extract PDUs from received TCPsegments to be able to act on them. Such an implementation will suffersevere performance penalties for reasons similar to those for which asoftware based TCP stack would. The described processor provides a highperformance and low latency architecture to transport Storage protocolon a TCP/IP based network that eliminates or greatly reduces theperformance penalty on the host processor, and the resulting latencyimpact.

FIG. 3 illustrates a comparison of the TCP/IP stack to Fibre channel asreferenced to the OSI networking stack. The TCP/IP stack, block 303, asdiscussed earlier in the Summary of the Invention section of thispatent, has performance problems resulting from the softwareimplementation on the hosts. Compared to that, specialized networkingprotocols like Fibre channel, block 304, and others are designed to beimplemented in hardware. The hardware implementation allows thenetworking solutions to be higher performance than the IP basedsolution. However, the ubiquitous nature of IP and the familiarity of IPfrom the IT users' and developers' perspective makes IP more suitablefor wide spread deployment. This can be accomplished if the performancepenalties resulting from TCP/IP are reduced to be equivalent to those ofthe other competing specialized protocols. FIG. 4 illustrates a protocollevel layering in hardware and software that is used for TCP/IP, block403, to become competitive to the other illustrated specializedprotocols.

FIG. 5 illustrates a host operating system stack using a hardware basedTCP/IP and storage protocol implementation of this patent. The protocolis implemented such that it can be introduced into the host operatingsystem stack, block 513, such that the operating system layers above itare unchanged. This allows the SCSI application protocols to operatewithout any change. The driver layer, block 515, and the stackunderneath for IP based storage interface, block 501, will represent asimilar interface as a non-networked SCSI interface, blocks 506 and 503or Fibre Channel interface, block 502.

FIG. 6 illustrates the data transfers involved in a software TCP/IPstack. Such an implementation of the TCP/IP stack carries hugeperformance penalties from memory copy of the data transfers. The figureillustrates data transfer between client and server networking stacks.User level application buffers, block 601, that need to be transportedfrom the client to the server or vice versa, go through the variouslevels of data transfers shown. The user application buffers on thesource get copied into the OS kernel space buffers, block 602. This datathen gets copied to the network driver buffers, block 603, from where itgets DMA-transferred to the network interface card (NIC) or the host busadapter (HBA) buffers, block 604. The buffer copy operations involve thehost processor and use up valuable processor cycles. Further, the databeing transferred goes through checksum calculations on the host usingup additional computing cycles from the host. The data movement into andout of the system memory on the host multiple times creates a memorybandwidth bottleneck as well. The data transferred to the NIC/HBA isthen sent on to the network, block 609, and reaches the destinationsystem. At the destination system the data packet traverses through thesoftware networking stack in the opposite direction as the host thoughfollowing similar buffer copies and checksum operations. Suchimplementation of TCP/IP stack is very inefficient for block storagedata transfers and for clustering applications where a large amount ofdata may be transferred between the source and the destination.

FIG. 7 illustrates the networking stack in an initiator and in a targetwith features that allow remote direct memory access (RDMA) features ofthe architecture described in this patent. The following can be calledan RDMA capability or an RDMA mechanism or an RDMA function. In such asystem the application running on the initiator or target registers aregion of memory, block 702, which is made available to its peer(s) foraccess directly from the NIC/HBA without substantial host intervention.These applications would also let their peer(s) know about the memoryregions being available for RDMA, block 708. Once both peers of thecommunication are ready to use the RDMA mechanism, the data transferfrom RDMA regions can happen with essentially zero copy overhead fromthe source to the destination without substantial host intervention ifNIC/HBA hardware in the peers implement RDMA capability. The source, orinitiator, would inform its peer of its desire to read or write specificRDMA enabled buffers and then let the destination or target, push orpull the data to/from its RDMA buffers. The initiator and the targetNIC/HBA would then transport the data using the TCP/IP hardwareimplementation described in this patent, RMDA 703, TCP/IP offload 704,RMDA 708 and TCP/IP offload 709, between each other without substantialintervention of the host processors, thereby significantly reducing theprocessor overhead. This mechanism would significantly reduce the TCP/IPprocessing overhead on the host processor and eliminate the need formultiple buffer copies for the data transfer illustrated in FIG. 6. RDMAenabled systems would thus allow the system, whether fast or slow, toperform the data transfer without creating a performance bottleneck forits peer. RDMA capability implemented in this processor in storage overIP solution eliminates host intervention except usually at the datatransfer start and termination. This relieves the host processors inboth target and initiator systems to perform useful tasks without beinginterrupted at each packet arrival or transfer. RDMA implementation alsoallows the system to be secure and prevent unauthorized access. This isaccomplished by registering the exported memory regions with the HBA/NICwith their access control keys along with the region IDs. The HBA/NICperforms the address translation of the memory region request from theremote host to the RDMA buffer, performs security operations such assecurity key verification and then allows the data transfer. Thisprocessing is performed off the host processor in the processor of thisinvention residing on the HBA/NIC or as a companion processor to thehost processor on the motherboard, for example. This capability can alsobe used for large data transfers for server clustering applications aswell as client server applications. Real time media applicationstransferring large amounts of data between a source or initiator and adestination or target can benefit from this.

FIG. 8 illustrates the host file system and SCSI stack implemented insoftware. As indicated earlier the IP based storage stack, blocks 805,806, 807, 808 and 809, should represent a consistent interface to theSCSI layers, blocks 803 and 804, as that provided by SCSI transportlayer, block 811, or Fibre channel transport, block 810. This figureillustrates high level requirements that are imposed on the IP basedstorage implementation from a system level, besides those imposed byvarious issues of IP which is not designed to transport performancesensitive block data.

FIG. 9 illustrates the iSCSI stack in more detail from that illustratedin FIG. 8. The iSCSI stack blocks 805 though 809, should provide an OSdefined driver interface level functionality to the SCSI commandconsolidation layer blocks 803 & 804, such that the behavior of thislayer and other layers on top of it are unchanged. FIG. 9 illustrates aset of functions that would be implemented to provide IP storagecapabilities. The functions that provide the iSCSI functionality aregrouped into related sets of functions, although there can be manyvariations of these as any person skilled in this area would appreciate.There are a set of functions that are required to meet the standard(e.g. target and initiator login and logout) functions, block 916,connection establishment and teardown functions, block 905. The figureillustrates functions that allow the OS SCSI software stack to discoverthe iSCSI device, block 916, set and get options/parameters, blocks 903and 909, to start the device, block 913 and release the device, block911. Besides the control functions discussed earlier, the iSCSIimplementation provides bulk data transfer functions, through queues 912and 917, to transport the PDUs specified by the iSCSI standard. TheiSCSI stack may also include direct data transfer/placement (DDT) orRDMA functions or combination thereof, block 918, which are used by theinitiator and target systems to perform substantially zero buffer copyand host intervention-less data transfers including storage and otherbulk block data transfers. The SCSI commands and the block datatransfers related to these are implemented as command queues, blocks 912and 917, which get executed on the described processor. The host isinterrupted primarily on the command completion. The completed commandsare queued for the host to act on at a time convenient to the host. Thefigure illustrates the iSCSI protocol layer and the driver layer layeredon the TCP/IP stack, blocks 907 and 908, which is also implemented offthe host processor on the IP processor system described herein.

FIG. 10 illustrates the TCP/IP stack functionality that is implementedin the described IP processor system. These functions provide aninterface to the upper layer protocol functions to carry the IP storagetraffic as well as other applications that can benefit from direct OSTCP/IP bypass, RDMA or network sockets direct capabilities orcombination thereof to utilize the high performance TCP/IPimplementation of this processor. The TCP/IP stack provides capabilitiesto send and receive upper layer data, blocks 1017 and 1031, and commandPDUs, establish the transport connections and teardown functions, block1021, send and receive data transfer functions, checksum functions,block 1019, as well as error handling functions, block 1022, andsegmenting and sequencing and windowing operations, block 1023. Certainfunctions like checksum verification/creation touch every byte of thedata transfer whereas some functions that transport the data packets andupdate the transmission control block or session data base are invokedfor each packet of the data transfer. The session DB, block 1025, isused to maintain various information regarding the activesessions/connections along with the TCP/IP state information. The TCPlayer is built on top of IP layer that provides the IP functionality asrequired by the standard. This layer provides functions tofragment/de-fragment, block 1033, the packets as per the path MTU,providing the route and forwarding information, block 1032, as well asinterface to other functions necessary for communicating errors like,for example, ICMP, block 1029. The IP layer interfaces with the Ethernetlayer or other media access layer technology to transport the TCP/IPpackets onto the network. The lower layer is illustrated as Ethernet invarious figures in this description, but could be other technologieslike SONET, for instance, to transport the packets over SONET onMANs/WANs. Ethernet may also be used in similar applications, but may beused more so within a LAN and dedicated local SAN environments, forexample.

FIG. 11 illustrates the iSCSI data flow. The figure illustrates thereceive and transmit path of the data flow. The Host's SCSI commandlayer working with the iSCSI driver, both depicted in block 1101, wouldschedule the commands to be processed to the command scheduler, block1108, in the storage flow controller seen in more detail in FIG. 26. Thecommand scheduler 1108 schedules the new commands for operation in theprocessor described in more detail in FIG. 17A new command that is meantfor the target device with an existing connection gets en-queued to thatexisting connection, block 1111. When the connection to the targetdevice does not exist, a new command is en-queued on to the unassignedcommand queue, block 1102. The session/connection establishment processlike that shown in FIG. 47 and blocks 905 and 1006 is then called toconnect to the target. Once the connection is established thecorresponding command from the queue 1102 gets en-queued to the newlycreated connection command queue 1111 by the command scheduler 1108 asillustrated in the figure. Once a command reaches a stage of execution,the receive 1107 or transmit 1109 path is activated depending on whetherthe command is a read or a write transaction. The state of theconnection/session which the command is transported is used to recordthe progress of the command execution in the session database asdescribed subsequently. The buffers associated with the data transfermay be locked till such time as the transfer is completed. If the RDMAmechanism is used to transfer the data between the initiator and thetarget, appropriate region buffers identifiers, access control keys andrelated RDMA state data is maintained in memory on board the processorand may also be maintained in off-chip memory depending on theimplementation chosen. As the data transfer, which may be over multipleTCP segments, associated with the command is completed the status of thecommand execution is passed onto the host SCSI layer which then does theappropriate processing. This may involve releasing the buffers beingused for data transfers to the applications, statistics update, and thelike. During transfer, the iSCSI PDUs are transmitted by the transmitengines, block 1109, working with the transmit command engines, block1110, that interpret the PDU and perform appropriate operations likeretrieving the application buffers from the host memory using DMA to thestorage processor and keeping the storage command flow information inthe iSCSI connection database updated with the progress. As used in thispatent the term “engine” can be a data processor or a part of a dataprocessor, appropriate for the function or use of the engine. Similarly,the receive engines, block 1107, interpret the received command into newrequests, response, errors or other command or data PDUs that need to beacted on appropriately. These receive engines working with the commandengines, block 1106, route the read data or received data to theappropriate allocated application buffer through direct datatransfer/placement or RDMA control information maintained for thesession in the iSCSI session table. On command completion the control tothe respective buffers, blocks 1103 and 1112, is released for theapplication to use. Receive and transmit engines can be the SAN packetprocessors 1706(a) to 1706(n) of FIG. 17 of this IP processor workingwith the session information recorded in the session data base entries1704, which can be viewed as a global memory as viewed from the TCP/IPprocessor of FIG. 23 or the IP processor of FIG. 24 The same engines canget reused for different packets and commands with the appropriatestorage flow context provided by the session database discussed in moredetail below with respect to block 1704 and portion of session databasein 1708 of FIG. 17. For clarification, the terms IP network applicationprocessor, IP Storage processor, IP Storage network applicationprocessor and IP processor can be the same entity, depending on theapplication. An IP network application processor core or an IP storagenetwork application processor core can be the same entity, depending onthe application.

Similarly a control command can use the transmit path whereas thereceived response would use the receive path. Similar engines can existon the initiator as well as the target. The data flow direction isdifferent depending on whether it is the initiator or the target.However, primarily similar data flow exists on both initiator and targetwith additional steps at the target. The target needs to performadditional operations to reserve the buffers needed to get the data of awrite command, for instance, or may need to prepare the read data beforethe data is provided to the initiator. Similar instances would exist incase of an intermediate device, although, in such a device, which may bea switch or an appliance, some level of virtualization or framefiltering or such other operation may be performed that may requiretermination of the session on one side and originating sessions on theother. This functionality is supported by this architecture but notillustrated explicitly in this figure, inasmuch as it is well within theknowledge of one of ordinary skill in the art.

FIG. 12 through FIG. 15 illustrate certain protocol informationregarding transport sessions and how that information may be stored in adatabase in memory.

FIG. 12 illustrates the data structures that are maintained for iSCSIprotocol and associated TCP/IP connections. The data belonging to eachiSCSI session, block 1201, which is essentially a nexus of initiator andtarget connections, is carried on the appropriate connection, block1202. Dependent commands are scheduled on the queues of the sameconnection to maintain the ordering of the commands, block 1203.However, unrelated commands can be assigned to different transportconnection. It is possible to have all the commands be queued to thesame connection, if the implementation supports only one connection persession. However, multiple connections per session are feasible tosupport line trunking between the initiator and the target. For example,in some applications, the initiator and the target will be incommunication with each other and will decide through negotiation toaccept multiple connections. In others, the initiator and target willcommunicate through only one session or connection. FIG. 13 and FIG. 14illustrate the TCP/IP and iSCSI session data base or transmissioncontrol block per session and connection. These entries may be carriedas separate tables or may be carried together as a composite table asseen subsequently with respect to FIGS. 23, 24, 26 and 29 depending onthe implementation chosen and the functionality implemented e.g. TCP/IPonly, TCP/IP with RDMA, IP Storage only, IP storage with TCP/IP, IPStorage with RDMA and the like. Various engines that perform TCP/IP andstorage flow control use all or some of these fields or more fields notshown, to direct the block data transfer over TCP/IP. The appropriatefields are updated as the connection progresses through the multiplestates during the course of data transfer. FIG. 15 illustrates onemethod of storing the transmission control entries in a memory subsystemthat consists of an on-chip session cache, blocks 1501 and 1502, andoff-chip session memory, blocks 1503, 1504, 1505, 1506 and 1507, thatretains the state information necessary for continuous progress of thedata transfers.

FIG. 16 illustrates the IP processor architecture at a high level ofabstraction. The processor consists of modular and scalable IP networkapplication processor core, block 1603. Its functional blocks providethe functionality for enabling high speed storage and data transportover IP networks. The processor core can include an intelligent flowcontroller, a programmable classification engine and a storage/networkpolicy engine. Each can be considered an individual processor or anycombination of them can be implemented as a single processor. Thedisclosed processor also includes a security processing block to providehigh line rate encryption and decryption functionality for the networkpackets. This, likewise, can be a single processor, or combined with theothers mentioned above. The disclosed processor includes a memorysubsystem, including a memory controller interface, which manages the onchip session cache/memory, and a memory controller, block 1602, whichmanages accesses to the off chip memory which may be SRAM, DRAM, FLASH,ROM, EEPROM, DDR SDRAM, RDRAM, FCRAM, QDR SRAM, or other derivatives ofstatic or dynamic random access memory or a combination thereof. The IPprocessor includes appropriate system interfaces to allow it to be usedin the targeted market segments, providing the right media interfaces,block 1601, for LAN, SAN, WAN and MAN networks, and similar networks,and appropriate host interface, block 1606. The media interface blockand the host interface block may be in a multi-port form where some ofthe ports may serve the redundancy and fail-over functions in thenetworks and systems in which the disclosed processor is used. Theprocessor also may contain the coprocessor interface block 1605, forextending the capabilities of the main processor for example creating amulti-processor system. The system controller interface of block 1604allows this processor to interface with an off-the-shelf microcontrollerthat can act as the system controller for the system in which thedisclosed processor may be used. The processor architecture also supporta control plane processor on board, that could act as the systemcontroller or session manager. The system controller interface may stillbe provided to enable the use of an external processor. Such a versionof this processor may not include the control processor for die costreasons. There are various types of the core architecture that can becreated, targeting specific system requirements, for example serveradapters or storage controllers or switch line cards or other networkingsystems. The primary differences would be as discussed in the earliersections of this patent. These processor blocks provide capabilities andperformance to achieve the high performance IP based storage usingstandard protocols like iSCSI, FCIP, iFCP and the like. The detailedarchitecture of these blocks will be discussed in the followingdescription.

FIG. 17 illustrates the IP processor architecture in more detail. Thearchitecture provides capabilities to process incoming IP packets fromthe media access control (MAC) layer, or other appropriate layer,through full TCP/IP termination and deep packet inspection. This blockdiagram does not show the MAC layer block 1601, or blocks 1602, 1604 or1605 of FIG. 16. The MAC layer interface blocks to the input queue,block 1701, and output queue, block 1712, of the processor in the mediainterface, block 1601, shown in FIG. 16. The MAC functionality could bestandards based, with the specific type dependent on the network.Ethernet and Packet over SONET are examples of the most widely usedinterfaces today which may be included on the same silicon or adifferent version of the processor created with each.

The block diagram in FIG. 17 illustrates input queue and output queueblocks 1701 and 1712 as two separate blocks. The functionality may beprovided using a combined block. The input queue block 1701 consists ofthe logic, control and storage to retrieve the incoming packets from theMAC interface block. Block 1701 queues the packets as they arrive fromthe interface and creates appropriate markers to identify start of thepacket, end of the packet and other attributes like a fragmented packetor a secure packet, and the like, working with the packet scheduler 1702and the classification engine 1703. The packet scheduler 1702, canretrieve the packets from the input queue controller and passes them forclassification to the classification engine. The classification block1703, is shown to follow the scheduler, however from a logicalperspective the classification engine receives the packet from the inputqueue, classifies the packet and provides the classification tag to thepacket, which is then scheduled by the scheduler to the processor array1706(a) . . . 1706(n). Thus the classification engine can act as apass-through classification engine, sustaining the flow of the packetsthrough its structure at the full line rate. The classification engineis a programmable engine that classifies the packets received from thenetwork in various categories and tags the packet with theclassification result for the scheduler and the other packet processorsto use. Classification of the network traffic is a very computeintensive activity which can take up to half of the processor cyclesavailable in a packet processor. This integrated classification engineis programmable to perform Layer 2 through Layer 7 inspection. Thefields to be classified are programmed in with expected values forcomparison and the action associated with them if there is a match. Theclassifier collects the classification walk results and can presentthese as a tag to the packet identifying the classification result asseen subsequently with respect to FIG. 30. This is much like a treestructure and is understood as a “walk.” The classified packets are thenprovided to the scheduler 1702 as the next phase of the processingpipeline.

The packet scheduler block 1702 includes a state controller andsequencer that assign packets to appropriate execution engines on thedisclosed processor. The execution engines are the SAN packetprocessors, block 1706(a) through 1706(n), including the TCP/IP and/orstorage engines as well as the storage flow/RDMA controller, block 1708or host bypass and/or other appropriate processors, depend on thedesired implementation. For clarity, the term “/”, when used todesignate hardware components in this patent, can mean “and/or” asappropriate. For example, the component “storage flow/RDMA controller”can be a storage flow and RDMA controller, a storage flow controller, oran RDMA controller, as appropriate for the implementation. The scheduler1702 also maintains the packet order through the processor where thestate dependency from a packet to a packet on the sameconnection/session is important for correct processing of the incomingpackets. The scheduler maintains various tables to track the progress ofthe scheduled packets through the processor until packet retirement. Thescheduler also receives commands that need to be scheduled to the packetprocessors on the outgoing commands and packets from the host processoror switch fabric controller or interface.

The TCP/IP and storage engines along with programmable packet processorsare together labeled as the SAN Packet Processors 1706(a) through1706(n) in FIG. 17. These packet processors are engines that areindependent programmable entities that serve a specific role.Alternatively, two or more of them can be implemented as a singleprocessor depending on the desired implementation. The TCP/IP engine ofFIG. 23 and the storage engines of FIG. 24 are configured in thisexample as coprocessors to the programmable packet processor engineblock 2101 of FIG. 21. This architecture can thus be applied withrelative ease to applications other than storage bysubstituting/removing for the storage engine for reasons of cost,manufacturability, market segment and the like. In a pure networkingenvironment the storage engine could be removed, leaving the packetprocessor with a dedicated TCP/IP engine and be applied for thenetworking traffic, which will face the same processing overhead fromTCP/IP software stacks. Alternatively one or more of the engines may bedropped for desired implementation e.g. for processor supporting only IPStorage functions may drop TCP/IP engine and/or packet engine which maybe in a separate chip. Hence, multiple variations of the core scalableand modular architecture are possible. The core architecture can thus beleveraged in applications beside the storage over IP applications bysubstituting the storage engine with other dedicated engines, forexample a high performance network security and policy engine, a highperformance routing engine, a high performance network managementengine, deep packet inspection engine providing string search, an enginefor XML, an engine for virtualization, and the like, providing supportfor an application specific acceleration. The processing capability ofthis IP processor can be scaled by scaling the number of SAN PacketProcessor blocks 1706(a) through 1706(n) in the chip to meet the linerate requirements of the network interface. The primary limitation fromthe scalability would come from the silicon real-estate required and thelimits imposed by the silicon process technologies. Fundamentally thisarchitecture is scalable to very high line rates by adding more SANpacket processor blocks thereby increasing the processing capability.Other means of achieving a similar result is to increase the clockfrequency of operation of the processor to that feasible within theprocess technology limits.

FIG. 17 also illustrates the IP session cache/memory and the memorycontroller block 1704. This cache can be viewed as an internal memory orlocal session database cache. This block is used to cache and store theTCP/IP session database and also the storage session database for acertain number of active sessions. The number of sessions that can becached is a direct result of the chosen silicon real-estate and what iseconomically feasible to manufacture. The sessions that are not on chip,are stored and retrieved to/from off chip memory, viewed as an externalmemory, using a high performance memory controller block which can bepart of block 1704 or otherwise. Various processing elements of thisprocessor share this controller using a high speed internal bus to storeand retrieve the session information. The memory controller can also beused to temporarily store packets that may be fragmented or when thehost interface or outbound queues are backed-up. The controller may alsobe used to store statistics information or any other information thatmay be collected by the disclosed processor or the applications runningon the disclosed or host processor.

The processor block diagram of FIG. 17 also illustrates host interfaceblock 1710, host input queue, block 1707 and host output queue, block1709 as well as the storage flow/RDMA controller, block 1708. Theseblocks provide the functions that are required to transfer data to andfrom the host (also called “peer”) memory or switch fabric. These blocksalso provide features that allow the host based drivers to schedule thecommands, retrieve incoming status, retrieve the session database entry,program the disclosed processor, and the like to enable capabilitieslike sockets direct architecture, full TCP/IP termination, IP storageoffload and the like capabilities with or without using RDMA. The hostinterface controller 1710, seen in greater detail in FIG. 27, providesthe configuration registers, DMA engines for direct memory to memorydata transfer, the host command block that performs some of the abovetasks, along with the host interface transaction controller and the hostinterrupt controller. The host input and output queues 1707, 1709provide the queuing for incoming and outgoing packets. The storage flowand RDMA controller block 1708 provides the functionality necessary forthe host to queue the commands to the disclosed processor, which thentakes these commands and executes them, interrupting the host processoron command termination. The RDMA controller portion of block 1708provides various capabilities necessary for enabling remote directmemory access. It has tables that include information such as RDMAregion, access keys, and virtual address translation functionality. TheRDMA engine inside this block performs the data transfer and interpretsthe received RDMA commands to perform the transaction if the transactionis allowed. The storage flow controller of block 1708 also keeps trackof the state of the progress of various commands that have beenscheduled as the data transfer happens between the target and theinitiator. The storage flow controller schedules the commands forexecution and also provides the command completion information to thehost drivers. The above can be considered RDMA capability and can beimplemented as described or by implementing as individual processors,depending on designer's choice. Also, additional functions can be addedto or removed from those described without departing from the spirit orthe scope of this patent.

The control plane processor block 1711 of this processor is used toprovide relatively slow path functionality for TCP/IP and/or storageprotocols which may include error processing with ICMP protocol, nameresolution, address resolution protocol, and it may also be programmedto perform session initiation/teardown acting as a sessioncontroller/connection manger, login and parameter exchange, and thelike. This control plane processor could be off chip to provide thesystem developer a choice of the control plane processor, or may be onchip to provide an integrated solution. If the control plane processoris off-chip, then an interface block would be created or integratedherein that would allow this processor to interface with the controlplane processor and perform data and command transfers. The internal busstructures and functional block interconnections may be different thanillustrated for all the detailed figures for performance, die costrequirements and the like and not depart from the spirit and the scopeof this patent.

Capabilities described above for FIG. 17 blocks with more detail below,enable a packet streaming architecture that allows packets to passthrough from input to output with minimal latency, with in-streamprocessing by various processing resources of the disclosed processor.

FIG. 18 illustrates the input queue and controller block shown generallyat 1701 of FIG. 17 in more detail. The core functionality of this blockis to accept the incoming packets from multiple input ports, Ports 1 toN, in blocks 1801 and 1802(i) to 1802(n), and to queue them using afixed or programmable priority on the input packet queue, block 1810,from where the packets get de-queued for classifier, scheduler andfurther packet processing through scheduler I/F blocks 1807-1814. Theinput queue controller interfaces with each of the input ports (Port 1through Port N in a multi-port implementation), and queues the packetsto the input packet queue 1810. The packet en-queue controller andmarker block 1804 may provide fixed priority functions or may beprogrammable to allow different policies to be applied to differentinterfaces based on various characteristics like port speed, the networkinterface of the port, the port priority and others that may beappropriate. Various modes of priority may be programmable likeround-robin, weighted round-robin or others. The input packet de-queuecontroller 1812 de-queues the packets and provides them to the packetscheduler, block 1702 of FIG. 17 via scheduler I/F 1814. The schedulerschedules the packets to the SAN packet processors 1706(a)-1706(n) oncethe packets have been classified by the classification engine 1703 ofFIG. 17. The encrypted packets can be classified as encrypted first andpassed on to the security engine 1705 of FIG. 17 by the secure packetinterface block 1813 of FIG. 18. for authentication and/or decryption ifthe implementation includes security processing otherwise the securityinterfaces may not be present and an external security processor wouldbe used to perform similar functions. The decrypted packets from clearpacket interface, block 1811, are then provided to the input queuethrough block 1812 from which the packet follows the same route as aclear packet. The fragmented IP packets may be stored on-chip in thefragmented packet store and controller buffers, block 1806, or may bestored in the internal or external memory. When the last fragmentarrives, the fragment controller of block 1806, working with theclassification engine and the scheduler of FIG. 17, merges thesefragments to assemble the complete packet. Once the fragmented packet iscombined to form a complete packet, the packet is scheduled into theinput packet queue via block 1804 and is then processed by the packetde-queue controller, block 1812, to be passed on to various otherprocessing stages of this processor. The input queue controller of FIG.18 assigns a packet tag/descriptor to each incoming packet which ismanaged by the attribute manager of block 1809 which uses the packetdescriptor fields like the packet start, size, buffer address, alongwith any other security information from classification engine, andstored in the packet attributes and tag array of block 1808. The packettag and attributes are used to control the flow of the packet throughthe processor by the scheduler and other elements of the processor in anefficient manner through interfaces 1807, 1811, 1813 and 1814

FIG. 19 illustrates the packet scheduler and sequencer 1702 of FIG. 17in more detail. This block is responsible for scheduling packets andtasks to the execution resources of this processor and thus also acts asa load balancer. The scheduler retrieves the packet headers from theheader queue, block 1902, from the input queue controller 1901 to passthem to the classification engine 1703 of Feb. 17 which returns theclassification results to the classifier queue, block 1909, that arethen used by the rest of the processor engines. The classificationengine may be presented primarily with the headers, but if deep packetinspection is also programmed, the classification engine may receive thecomplete packets which it routes to the scheduler after classification.The scheduler comprises a classification controller/scheduler, block1908, which manages the execution of the packets through theclassification engine. This block 1908 of FIG. 19 provides the commandsto the input queue controller, block 1901, in case of fragmented packetsor secure packets, to perform the appropriate actions for such packetse.g. schedule an encrypted packet to the security engine of FIG. 17. Thescheduler state control and the sequencer, block 1916, receive stateinformation of various transactions/operations active inside theprocessor and provide instructions for the next set of operations. Forinstance, the scheduler retrieves the packets from the input packetqueue of block 1903, and schedules these packets in the appropriateresource queue depending on the results of the classification receivedfrom the classifier or directs the packet to the packet memory, block1913 or 1704 through 1906, creating a packet descriptor/tag which may beused to retrieve the packet when appropriate resource needs it toperforms its operations at or after scheduling. The state control andsequencer block 1916 instructs/directs the packets with theirclassification result, block 1914, to be stored in the packet memory,block 1913, from where the packets get retrieved when they are scheduledfor operation. The state controller and the sequencer identify theexecution resource that should receive the packet for operation andcreates a command and assigns this command with the packet tag to theresource queues, blocks 1917 (Control Plane), 1918 (port i-port n), 1919(bypass) and 1920 (host) of FIG. 19. The priority selector 1921 is aprogrammable block that retrieves the commands and the packet tag fromthe respective queues based on the assigned priority and passes this tothe packet fetch and command controller, block 1922. This blockretrieves the packet from the packet memory store 1913 along with theclassification results and schedules the packet transfer to theappropriate resource on the high performance processor command andpacket busses such as at 1926 when the resource is ready for operation.The bus interface blocks, like command bus interface controller 1905, ofthe respective recipients interpret the command and accept the packetand the classification tag for operation. These execution engines informthe scheduler when the packet operation is complete and when the packetis scheduled for its end destination (either the host bus interface, orthe output interface or control plane interface, etc.). This allows thescheduler to retire the packet from its state with the help ofretirement engine of block 1904 and frees up the resource entry for thissession in the resource allocation table, block 1923. The resourceallocation table is used by the sequencer to assign the received packetsto specific resources, depending on the current state of internal stateof these resources, e.g. the session database cache entry buffered inthe SAN packet processor engine, the connection ID of the current packetbeing executed in the resource, and the like. Thus packets that aredependent on an ordered execution get assigned primarily to the sameresource, which improves memory traffic and performance by using thecurrent DB state in the session memory in the processor and not have toretrieve new session entries. The sequencer also has interface to thememory controller, block 1906, for queuing of packets that arefragmented packets and/or for the case in which the scheduler queues getbacked-up due to a packet processing bottleneck down stream, which maybe caused by specific applications that are executed on packets thattake more time than that allocated to maintain a full line rateperformance, or for the case in which any other downstream systems getfull, unable to sustain the line rate.

If the classifier is implemented before the scheduler as discussed abovewith respect to FIG. 17 where the classification engine receives thepacket from the input queue, items 1901, 1902, 1908, 1909 and 1910 wouldbe in the classifier, or may not be needed, depending on the particulardesign. The appropriate coupling from the classifier to/from thescheduler blocks 1903, 1907, 1914 and 1915 may be created in such ascenario and the classifier coupled directly to the input queue block ofFIG. 18.

FIG. 20 illustrates the packet classification engine shown generally at1703 of FIG. 17. Classification of the packets into their variousattributes is a very compute intensive operation. The classifier can bea programmable processor that examines various fields of the receivedpacket to identify the type of the packet, the protocol type e.g. IP,ICMP, TCP, UDP etc, the port addresses, the source and destinationfields, etc. The classifier can be used to test a particular field or aset of fields in the header or the payload. The block diagramillustrates a content addressable memory based classifier. However, asdiscussed earlier this could be a programmable processor as well. Theprimary differences are the performance and complexity of implementationof the engine. The classifier gets the input packets through thescheduler from the input queues, blocks 2005 and 2004 of FIG. 20. Theinput buffers 2004 queue the packets/descriptor and/or the packetheaders that need to be classified. Then the classification sequencer2003 fetches the next available packet in the queue and extracts theappropriate packet fields based on the global field descriptor sets,block 2007, which are, or can be, programmed Then the classifier passesthese fields to the content addressable memory (CAM) array, block 2009,to perform the classification. As the fields are passed through the CAMarray, the match of these fields identifies next set of fields to becompared and potentially their bit field location. The match in the CAMarray results in the action/event tag, which is collected by the resultcompiler, (where “compiling” is used in the sense of “collecting”) block2014 and also acted on as an action that may require updating the datain the memory array, block 2013, associated with specific CAM conditionor rule match. This may include performing an arithmetic logic unit(ALU) operation, block 2017, which can be considered one example of anexecution resource) on this field e.g. increment or decrement thecondition match and the like. The CAM arrays are programmed with thefields, their expected values and the action on match, including nextfield to compare, through the database initialization block 2011,accessible for programming through the host or the control planeprocessor interfaces 1710, 1711. Once the classification reaches a leafnode the classification is complete and the classification tag isgenerated that identifies the path traversed that can then be used byother engines of the IP processor avoid performing the sameclassification tasks. For example a classification tag may include theflow or session ID, protocol type indication e.g. TCP/UDP/ICMP etc.,value indicating whether to processes, bypass, drop packet, dropsession, and the like, or may also include the specific firmware coderoutine pointer for the execution resource to start packet processing ormay include signature of the classification path traversed or the like.The classification tag fields are chosen based on processorimplementation and functionality. The classifier retirement queue, block2015, holds the packets/descriptors of packets that are classified andclassification tag and are waiting to be retrieved by the scheduler. Theclassification data base can be extended using database extensioninterface and pipeline control logic block 2006. This allows systemsthat need extensibility for a larger classification database to bebuilt. The classification engine with the action interpreter, the ALUand range matching block of 2012 also provide capabilities to programstorage/network policies/actions that need to be taken if certainpolicies are met. The policies can be implemented in the form of ruleand action tables. The policies get compiled and programmed in theclassification engine through the host interface along with theclassification tables. The database interface and pipeline control 2006could be implemented to couple to companion processor to extend the sizeof the classification/policy engine.

FIG. 21 illustrates the SAN Packet Processor shown generally at 1706(a)through 1706(n) of FIG. 17. A packet processor can be a speciallydesigned packet processor, or it can be any suitable processor such asan ARM, ARC, Tensilica, MIPS, StrongARM, X86, PowerPC, Pentiumprocessor, iA64 or any other processor that serves the functionsdescribed herein. This is also referred as the packet processor complexin various sections of this patent. This packet processor comprises apacket engine, block 2101, which is generally a RISC OR VLIW machinewith target instructions for packet processing or a TCP/IP engine, block2102 or an IP storage engine, block 2103 or a combination thereof. Theseengines can be configured as coprocessors to the packet engine or can beindependent engines. FIG. 22 illustrates the packet engine in moredetail. The packet engine is a generally RISC OR VLIW machine asindicated above with instruction memory, block 2202, and Data Memory,block 2206, (both of which can be RAM) that are used to hold the packetprocessing micro routines and the packets and intermediate storage. Theinstruction memory 2202 which, like all such memory in this patent, canbe RAM or other suitable storage, is initialized with the code that isexecuted during packet processing. The packet processing code isorganized as tight micro routines that fit within the allocated memory.The instruction decoder and the sequencer, block 2204, fetches theinstructions from instruction memory 2202, decodes them and sequencesthem through the execution blocks contained within the ALU, block 2208.This machine can be a simple pipelined engine or a more complex deeppipelined machine that may also be designed to provide a packet orientedinstruction set. The DMA engine, block 2205 and the bus controller,block 2201, allow the packet engine to move the data packets from thescheduler of FIG. 19 and the host interface into the data memory 2206for operation. The DMA engine may hold multiple memory descriptors tostore/retrieve packet/data to/from host memory/packet memory. This wouldenable memory accesses to happen in parallel to packet processor engineoperations. The DMA engine 2205 also may be used to move the datapackets to and from the TCP and storage engines 2210, 2211. Once theexecution of the packet is complete, the extracted data or newlygenerated packet is transferred to the output interface either towardsthe media interface or the host interface

FIG. 23 illustrates a programmable TCP/IP packet processor engine, seengenerally at 2210 of FIG. 22, in more detail. This engine is generally aprogrammable processor with common RISC OR VLIW instructions along withvarious TCP/IP oriented instructions and execution engines but couldalso be a micro-coded or a state machine driven processor withappropriate execution engines described in this patent. The TCPprocessor includes a checksum block, 2311, for TCP checksum verificationand new checksum generation by executing these instructions on theprocessor. The checksum block extracts the data packet from the packetbuffer memory (a Data RAM is one example of such memory), 2309, andperforms the checksum generation or verification. The packet look-upinterface block, 2310, assists the execution engines and the instructionsequencer, 2305, providing access to various data packet fields or thefull data packet. The classification tag interpreter, 2313, is used bythe instruction decoder 2304 to direct the program flow based on theresults of the classification if such an implementation is chosen. Theprocessor provides specific sequence and windowing operations includingsegmentation, block 2315, for use in the TCP/IP data sequencingcalculations for example, to look-up the next expected sequence numberand see if that received is within the agreed upon sliding window, whichsliding window is a well known part of the TCP protocol, for theconnection to which the packet belongs. This element 2315 may alsoinclude a segmentation controller like that show at 2413 of FIG. 24.Alternatively, one of ordinary skill in the art, with the teaching ofthis patent, can easily implement the segmentation controllers elsewhereon the TCP/IP processor of this FIG. 23. The processor provides a hashengine, block 2317, which is used to perform hash operations againstspecific fields of the packet to perform a hash table walk that may berequired to get the right session entry for the packet. The processoralso includes a register file, block 2316, which extracts variouscommonly used header fields for TCP processing, along with pointerregisters for data source and destination, context register sets, andregisters that hold the TCP states along with a general purpose registerfile. The TCP/IP processor can have multiple contexts for packetexecution, so that when a given packet execution stalls for any reason,for example memory access, the other context can be woken up and theprocessor continue the execution of another packet stream with littleefficiency loss. The TCP/IP processor engine also maintains a localsession cache, block 2320, which holds most recently used or mostfrequently used entries, which can be used locally without needing toretrieve them from the global session memory. The local session cachecan be considered an internal memory of the TCP/IP processor, which canbe a packet processor. Of course, the more entries that will be usedthat can be stored locally in the internal memory, without retrievingadditional ones from the session, or global, memory, the more efficientthe processing will be. The packet scheduler of FIG. 19 is informed ofthe connection IDs that are cached per TCP/IP processor resource, sothat it can schedule the packets that belong to the same session to thesame packet processor complex. When the packet processor does not holdthe session entry for the specific connection, then the TCP sessiondatabase lookup engine, block 2319, working with the session manager,block 2321, and the hash engine retrieves the corresponding entry fromthe global session memory through the memory controller interface, block2323. There are means, such as logic circuitry inside the sessionmanager that allow access of session entries or fields of sessionentries, that act with the hash engine to generate the sessionidentifier for storing/retrieving the corresponding session entry or itsfields to the session database cache. This can be used to update thosefields or entries as a result of packet processing. When a new entry isfetched, the entry which it is replacing is stored to the global sessionmemory. The local session caches may follow exclusivity cachingprinciples, so that multiple processor complexes do not cause any raceconditions, damaging the state of the session. Other caching protocolslike MESI protocol may also be used to achieve similar results. When asession entry is cached in a processor complex, and another processorcomplex needs that entry, this entry is transferred to the new processorwith exclusive access or appropriate caching state based on thealgorithm. The session entry may also get written to the global sessionmemory in certain cases. The TCP/IP processor also includes a TCP statemachine, block 2322, which is used to walk through the TCP states forthe connection being operated on. This state machine receives the stateinformation stored in the session entry along with the appropriatefields affecting the state from the newly received packet. This allowsthe state machine to generate the next state if there is a statetransition and the information is updated in the session table entry.The TCP/IP processor also includes a frame controller/out of ordermanager block, 2318, that is used to extract the frame information andperform operations for out of order packet execution. This block couldalso include an RDMA mechanism such as that shown at 2417 of FIG. 24,but used for non-storage data transfers. One of ordinary skill in theart can also, with the teaching of this patent, implement an RDMAmechanism elsewhere on the TCP/IP processor. This architecture createsan upper layer framing mechanism which may use packet CRC as framing keyor other keys that is used by the programmable frame controller toextract the embedded PDUs even when the packets arrive out of order andallow them to be directed to the end buffer destination. This unitinteracts with the session database to handle out of order arrivalinformation which is recorded so that once the intermediate segmentsarrive, the retransmissions are avoided. Once the packet has beenprocessed through the TCP/IP processor, it is delivered for operation tothe storage engine, if the packet belongs to a storage data transfer andthe specific implementation includes a storage engine, otherwise thepacket is passed on to the host processor interface or the storageflow/RDMA controller of block 1708 for processing and for DMA to the endbuffer destination. The packet may be transferred to the packetprocessor block as well for any additional processing on the packet.This may include application and customer specific application code thatcan be executed on the packet before or after the processing by theTCP/IP processor and the storage processor. Data transfer from the hostto the output media interface would also go through the TCP/IP processorto form the appropriate headers to be created around the data and alsoperform the appropriate data segmentation, working with the framecontroller and/or the storage processor as well as to update the sessionstate. This data may be retrieved as a result of host command orreceived network packet scheduled by the scheduler to the packetprocessor for operation. The internal bus structures and functionalblock interconnections may be different than illustrated forperformance, die cost requirements and the like. For example, HostController Interface 2301, Scheduler Interface 2307 and MemoryController Interface 2323 may be part of a bus controller that allowstransfer of data packets or state information or commands, or acombination thereof, to or from a scheduler or storage flow/RDMAcontroller or host or session controller or other resources such as,without limitation, security processor, or media interface units, hostinterface, scheduler, classification processor, packet buffers orcontroller processor, or any combination of the foregoing.

FIG. 24 illustrates the IP storage processor engine of FIG. 22 in moredetail. The storage engine is a programmable engine with an instructionset that is geared towards IP based storage along with, usually, anormal RISC OR VLIW-like packet processing instruction set. The IPstorage processor engine contains block 2411, to perform CRC operations.This block allows CRC generation and verification. The incoming packetwith IP storage is transferred from the TCP/IP engine through DMA,blocks 2402 and 2408, into the data memory (a data RAM is an example ofsuch memory), block 2409. When the implementation does not includeTCP/IP engine or packet processor engine or a combination thereof, thepacket may be received from the scheduler directly for example. The TCPsession database information related to the connection can be retrievedfrom the local session cache as needed or can also be received with thepacket from the TCP/IP engine. The storage PDU is provided to the PDUclassifier engine, block 2418, which classifies the PDU into theappropriate command, which is then used to invoke the appropriatestorage command execution engine, block 2412. The command execution canbe accomplished using the RISC OR VLIW, or equivalent, instruction setor using a dedicated hardware engine. The command execution enginesperform the command received in the PDU. The received PDU may containread command data, or R2T for a pending write command or other commandsrequired by the IP storage protocol. These engines retrieve the writedata from the host interface or direct the read data to the destinationbuffer. The storage session database entry is cached, in what can beviewed as a local memory, block 2420, locally for the recent or frequentconnections served by the processor. The command execution enginesexecute the commands and make the storage database entry updates workingwith the storage state machine, block 2422, and the session manager,block 2421. The connection ID is used to identify the session, and ifthe session is not present in the cache, then it is retrieved from theglobal session memory 1704 of FIG. 17 by the storage session look-upengine, block 2419. For data transfer from the initiator to target, theprocessor uses the segmentation controller, block 2413, to segment thedata units into segments as per various network constraints like pathMTU and the like. The segmentation controller attempts to ensure thatthe outgoing PDUs are optimal size for the connection. If the datatransfer requested is larger than the maximum effective segment size,then the segmentation controller packs the data into multiple packetsand works with the sequence manager, block 2415, to assign the sequencenumbers appropriately. The segmentation controller 2413 may also beimplemented within the TCP/IP processor of FIG. 23. That is, thesegmentation controller may be part of the sequence/window operationsmanager 2315 of FIG. 23 when this processor is used for TCP/IPoperations and not storage operations. One of ordinary skill in the artcan easily suggest alternate embodiments for including the segmentationcontroller in the TCP/IP processor using the teachings of this patent.The storage processor of FIG. 24 (or the TCP/IP processor of FIG. 23)can also include an RDMA engine that interprets the remote direct memoryaccess instructions received in the PDUs for storage or network datatransfers that are implemented using this RDMA mechanism. In FIG. 24,for example, this is RDMA engine 2417. In the TCP/IP processor of FIG.23 an RDMA engine could be part of the frame controller and out of ordermanager 2318, or other suitable component. If both ends of theconnection agree to the RDMA mode of data transfer, then the RDMA engineis utilized to schedule the data transfers between the target andinitiator without substantial host intervention. The RDMA transfer stateis maintained in a session database entry. This block creates the RDMAheaders to be layered around the data, and is also used to extract theseheaders from the received packets that are received on RDMA enabledconnections. The RDMA engine works with the storage flow/RDMAcontroller, 1708, and the host interface controller, 1710, by passingthe messages/instructions and performs the large block data transferswithout substantial host intervention. The RDMA engine of the storageflow/RDMA controller block, 1708, of the IP processor performsprotection checks for the operations requested and also providesconversion from the RDMA region identifiers to the physical or virtualaddress in the host space. This functionality may also be provided byRDMA engine, block 2417, of the storage engine of the SAN packetprocessor based on the implementation chosen. The distribution of theRDMA capability between 2417 and 1708 and other similar engines is animplementation choice that one with ordinary skill in the art will beable to do with the teachings of this patent. Outgoing data is packagedinto standards based PDU by the PDU creator, block 2425. The PDUformatting may also be accomplished by using the packet processinginstructions. The storage engine of FIG. 24 works with the TCP/IP engineof FIG. 23 and the packet processor engine of FIG. 17 to perform the IPstorage operations involving data and command transfers in bothdirections i.e. from the initiator to target and the target to the hostand vice versa. That is, the Host controller Interface 2401, 2407 storeand retrieve commands or data or a combination thereof to or from thehost processor. These interfaces may be directly connected to the hostor may be connected through an intermediate connection. Though shown astwo apparatus, interfaces 2401 and 2407 could be implemented as a singleapparatus. The flow of data through these blocks would be differentbased on the direction of the transfer. For instance, when command ordata is being sent from the host to the target, the storage processingengines will be invoked first to format the PDU and then this PDU ispassed on to the TCP processor to package the PDU in a valid TCP/IPsegment. However, a received packet will go through the TCP/IP enginebefore being scheduled for the storage processor engine. The internalbus structures and functional block interconnections may be differentthan illustrated for performance, die cost requirements, and the like.For example, and similarly to FIG. 23, Host Controller Interface 2401,2407 and Memory Controller Interface 2423 may be part of a buscontroller that allows transfer of data packets or state information orcommands, or a combination thereof, to or from a scheduler or host orstorage flow/RDMA controller or session controller or other resourcessuch as, without limitation, security processor, or media interfaceunits, host interface, scheduler, classification processor, packetbuffers or controller processor, or any combination of the foregoing.

In applications in which storage is done on a chip not including theTCP/IP processor of FIG. 23 by, as one example, an IP Storage processorsuch as an iSCSI processor of FIG. 24, the TCP/IP Interface 2406 wouldfunction as an interface to a scheduler for scheduling IP storage packetprocessing by the IP Storage processor. Similar variations are wellwithin the knowledge of one of ordinary skill in the art, viewing thedisclosure of this patent.

FIG. 25 illustrates the output queue controller block 1712 of FIG. 17 inmore detail. This block receives the packets that need to be sent on tothe network media independent interface 1601 of FIG. 16. The packets maybe tagged to indicate if they need to be encrypted before being sentout. The controller queues the packets that need to be secured to thesecurity engine through the queue 2511 and security engine interface2510. The encrypted packets are received from the security engine andare queued in block 2509, to be sent to their destination. The outputqueue controller may assign packets onto their respective quality ofservice (QOS) queues, if such a mechanism is supported. The programmablepacket priority selector, block 2504, selects the next packet to be sentand schedules the packet for the appropriate port, Port1 . . . PortN.The media controller block 1601 associated with the port accepts thepackets and sends them to their destination.

FIG. 26 illustrates the storage flow controller/RDMA controller block,shown generally at 1708 of FIG. 17, in more detail. The storage flow andRDMA controller block provides the functionality necessary for the hostto queue the commands (storage or RDMA or sockets direct or acombination thereof) to this processor, which then takes these commandsand executes them, interrupting the host processor primarily on commandtermination. The command queues, new and active, blocks 2611 and 2610,and completion queue, block 2612, can be partially on chip and partiallyin a host memory region or memory associated with the IP processor, fromwhich the commands are fetched or the completion status deposited. TheRDMA engine, block 2602, provides various capabilities necessary forenabling remote direct memory access. It has tables, like RDMA look-uptable 2608, that include information like RDMA region and the accesskeys, and virtual address translation functionality. The RDMA engineinside this block 2602 performs the data transfer and interprets thereceived RDMA commands to perform the transaction if allowed. Thestorage flow controller also keeps track of the state of the progress ofvarious commands that have been scheduled as the data transfer happensbetween the target and the initiator. The storage flow controllerschedules the commands for execution and also provides the commandcompletion information to the host drivers. The storage flow controllerprovides command queues where new requests from the host are deposited,as well as active commands are held in the active commands queue. Thecommand scheduler of block 2601, assigns new commands, that are receivedwhich are for targets for which no connections exist, to the schedulerfor initiating a new connection. The scheduler 1702, uses the controlplane processor shown generally at 1711 of FIG. 17 to do the connectionestablishment at which point the connection entry is moved to thesession cache, shown generally in FIGS. 15 and 1704 in FIG. 17, and thestate controller in the storage flow controller block 2601 moves the newcommand to active commands and associates the command to the appropriateconnection. The active commands, in block 2610, are retrieved and sentto the scheduler, block 1702 for operation by the packet processors. Theupdate to the command status is provided back to the flow controllerwhich then stores it in the command state tables, blocks 2607 andaccessed through block 2603. The sequencer of 2601 applies aprogrammable priority for command scheduling and thus selects the nextcommand to be scheduled from the active commands and new commands. Theflow controller also includes a new requests queue for incomingcommands, block 2613. The new requests are transferred to the activecommand queue once the appropriate processing and buffer reservationsare done on the host by the host driver. As the commands are beingscheduled for execution, the state controller 2601 initiates datapre-fetch by host data pre-fetch manager, block 2617, from the hostmemory using the DMA engine of the host interface block 2707, hencekeeping the data ready to be provided to the packet processor complexwhen the command is being executed. The output queue controller, block2616, enables the data transfer, working with the host controllerinterface, block 2614. The storage flow/RDMA controller maintains atarget-initiator table, block 2609, that associates thetarget/initiators that have been resolved and connections establishedfor fast look-ups and for associating commands to active connections.The command sequencer may also work with the RDMA engine 2602, if thecommands being executed are RDMA commands or if the storage transferswere negotiated to be done through the RDMA mechanism at the connectioninitiation. The RDMA engine 2602, as discussed above, providesfunctionality to accept multiple RDMA regions, access control keys andthe virtual address translation pointers. The host application (whichmay be a user application or an OS kernel function, storage ornon-storage such as downloading web pages, video files, or the like)registers a memory region that it wishes to use in RDMA transactionswith the disclosed processor through the services provided by theassociated host driver. Once this is done, the host applicationcommunicates this information to its peer on a remote end. Now, theremote machine or the host can execute RDMA commands, which are servedby the RDMA blocks on both ends without requiring substantial hostintervention. The RDMA transfers may include operations like read from aregion, a certain number of bytes with a specific offset or a write withsimilar attributes. The RDMA mechanism may also include sendfunctionality which would be useful in creating communication pipesbetween two end nodes. These features are useful in clusteringapplications where large amounts of data transfer is required betweenbuffers of two applications running on servers in a cluster, or morelikely, on servers in two different clusters of servers, or such otherclustered systems. The storage data transfer may also be accomplishedusing the RDMA mechanism, since it allows large blocks of data transferswithout substantial host intervention. The hosts on both ends getinitially involved to agree on doing the RDMA transfers and allocatingmemory regions and permissions through access control keys that getshared. Then the data transfer between the two nodes can continuewithout host processor intervention, as long as the available bufferspace and buffer transfer credits are maintained by the two end nodes.The storage data transfer protocols would run on top of RDMA, byagreeing to use RDMA protocol and enabling it on both ends. The storageflow controller and RDMA controller of FIG. 26 can then perform thestorage command execution and the data transfer using RDMA commands. Asthe expected data transfers are completed the storage command completionstatus is communicated to the host using the completion queue 2612. Theincoming data packets arriving from the network are processed by thepacket processor complex of FIG. 17 and then the PDU is extracted andpresented to the flow controller of FIG. 26 in case of storage/RDMA datapackets. These are then assigned to the incoming queue block 2604, andtransferred to the end destination buffers by looking up the memorydescriptors of the receiving buffers and then performing the DMA usingthe DMA engine inside the host interface block 2707. The RDMA commandsmay also go through protection key look-up and address translation asper the RDMA initialization.

The foregoing may also be considered a part of an RDMA capability or anRDMA mechanism or an RDMA function.

FIG. 27 illustrates host interface controller 1710 of FIG. 17 in moredetail. The host interface block includes a host bus interfacecontroller, block 2709, which provides the physical interface to thehost bus. The host interface block may be implemented as a fabricinterface or media independent interface when embodied in a switch or agateway or similar configuration depending on the system architectureand may provide virtual output queuing and/or other quality of servicefeatures. The transaction controller portion of block 2708, executesvarious bus transactions and maintains their status and takes requestedtransactions to completion. The host command unit, block 2710, includeshost bus configuration registers and one or more command interpreters toexecute the commands being delivered by the host. The host driverprovides these commands to this processor over Host Output QueueInterface 2703. The commands serve various functions like setting upconfiguration registers, scheduling DMA transfers, setting up DMAregions and permissions if needed, setup session entries, retrievesession database, configure RDMA engines and the like. The storage andother commands may also be transferred using this interface forexecution by the IP processor.

FIG. 28 illustrates the security engine 1705 of FIG. 17 in more detail.The security engine illustrated provides authentication and encryptionand decryption services like those required by standards like IPSEC forexample. The services offered by the security engine may includemultiple authentication and security algorithms. The security engine maybe on-board the processor or may be part of a separate silicon chip asindicated earlier. An external security engine providing IP securityservices would be situated in a similar position in the data flow, asone of the first stages of packet processing for incoming packets and asone of the last stages for the outgoing packet. The security engineillustrated provides advanced encryption standard (AES) based encryptionand decryption services, which are very hardware performance efficientalgorithms adopted as security standards. This block could also provideother security capabilities like DES, 3DES, as an example. The supportedalgorithms and features for security and authentication are driven fromthe silicon cost and development cost. The algorithms chosen would alsobe those required by the IP storage standards. The authenticationengine, block 2803, is illustrated to include the SHA-1 algorithm as oneexample of useable algorithms. This block provides message digest andauthentication capabilities as specified in the IP security standards.The data flows through these blocks when security and messageauthentication services are required. The clear packets on their way outto the target are encrypted and are then authenticated if required usingthe appropriate engines. The secure packets received go through the samesteps in reverse order. The secure packet is authenticated and thendecrypted using the engines 2803, 2804 of this block. The securityengine also maintains the security associations in a security contextmemory, block 2809, that are established for the connections. Thesecurity associations (may include secure session index, security keys,algorithms used, current state of session and the like) are used toperform the message authentication and the encryption/decryptionservices. It is possible to use the message authentication service andthe encryption/decryption services independent of each other. Thesecurity engine of FIG. 28 or classification/policy engine of FIG. 20 ora combination thereof along with other protocol processing hardwarecapabilities of this patent create a secure TCP/IP stack using thehardware processor of this patent.

FIG. 29 illustrates the session cache and memory controller complex seengenerally at 1704 of FIG. 17 in more detail. The memory complex includesa cache/memory architecture for the TCP/IP session database calledsession/global session memory or session cache in this patent,implemented as a cache or memory or a combination thereof The sessioncache look-up engine, block 2904, provides the functionality to look-upa specific session cache entry. This look-up block creates a hash indexout of the fields provided or is able to accept a hash key and looks-upthe session cache entry. If there is no tag match in the cache arraywith the hash index, the look-up block uses this key to find the sessionentry from the external memory and replaces the current session cacheentry with that session entry. It provides the session entry fields tothe requesting packet processor complex. The cache entries that arepresent in the local processor complex cache are marked shared in theglobal cache. Thus when any processor requests this cache entry, it istransferred to the global cache and the requesting processor and markedas such in the global cache. The session memory controller is alsoresponsible to move the evicted local session cache entries into theglobal cache inside this block. Thus only the latest session state isavailable at any time to any requesters for the session entry. If thesession cache is full, a new entry may cause the least recently usedentry to be evicted to the external memory. The session memory may besingle way or multi-way cache or a hash indexed memory or a combinationthereof, depending on the silicon real estate available in a givenprocess technology. The use of a cache for storing the session databaseentry is unique, in that in networking applications for network switchesor routers, generally there is not much locality of reference propertiesavailable between packets, and hence use of cache may not provide muchperformance improvement due to cache misses. However, the storagetransactions are longer duration transactions between the two endsystems and may exchange large amounts of data. In this scenario orcases where a large amount of data transfer occurs between two nodes,like in clustering or media servers or the like a cache based sessionmemory architecture will achieve significant performance benefit fromreducing the enormous data transfers from the off chip memories. Thesize of the session cache is a function of the available silicon diearea and can have an impact on performance based on the trade-off Thememory controller block also provides services to other blocks that needto store packets, packet fragments or any other operating data inmemory. The memory interface provides single or multiple external memorycontrollers, block 2901, depending on the expected data bandwidth thatneeds to be supported. This can be a double data rate controller orcontroller for DRAM or SRAM or RDRAM or other dynamic or static RAM orcombination thereof. The figure illustrates multi-controllers howeverthe number is variable depending on the necessary bandwidth and thecosts. The memory complex may also provide timer functionality for usein retransmission time out for sessions that queue themselves on theretransmission queues maintained by the session database memory block.

FIG. 30 illustrates the data structures details for the classificationengine. This is one way of organizing the data structures for theclassification engine. The classification database is illustrated as atree structure, block 3001, with nodes, block 3003, in the tree and theactions, block 3008, associated with those nodes allow theclassification engine to walk down the tree making comparisons for thespecific node values. The node values and the fields they represent areprogrammable. The action field is extracted when a field matches aspecific node value. The action item defines the next step, which mayinclude extracting and comparing a new field, performing otheroperations like ALU operations on specific data fields associated withthis node-value pair, or may indicate a terminal node, at which pointthe classification of the specific packet is complete. This datastructure is used by the classification engine to classify the packetsthat it receives from the packet scheduler. The action items that areretrieved with the value matches, while iterating different fields ofthe packet, are used by the results compiler to create a classificationtag, which is attached to the packet, generally before the packetheaders. The classification tag is then used as a reference by the restof the processor to decide on the actions that need to be taken based onthe classification results. The classifier with its programmablecharacteristics allows the classification tree structure to be changedin-system and allow the processor to be used in systems that havedifferent classification needs. The classification engine also allowscreation of storage/network policies that can be programmed as part ofthe classification tree-node-value-action structures and provide a verypowerful capability in the IP based storage systems. The policies wouldenhance the management of the systems that use this processor and allowenforcement capabilities when certain policies or rules are met orviolated. The classification engine allows expansion of theclassification database through external components, when that isrequired by the specific system constraints. The number of trees andnodes are decided based on the silicon area and performance tradeoffs.The data structure elements are maintained in various blocks of theclassification engine and are used by the classification sequencer todirect the packet classification through the structures. Theclassification data structures may require more or less fields thanthose indicated depending on the target solution. Thus the corefunctionality of classification may be achieved with fewer componentsand structures without departing from the basic architecture. Theclassification process walks through the trees and the nodes asprogrammed A specific node action may cause a new tree to be used forthe remaining fields for classification. Thus, the classificationprocess starts at the tree root and progress through the nodes until itreaches the leaf node.

FIG. 31 illustrates a read operation between an initiator and target.The initiator sends a READ command request, block 3101, to the target tostart the transaction. This is an application layer request which ismapped to specific SCSI protocol command which is than transported as anREAD protocol data unit, block 3102, in an IP based storage network. Thetarget prepares the data that is requested, block 3103 and provides readresponse PDUs, block 3105, segmented to meet the maximum transfer unitlimits. The initiator then retrieves the data, block 3016, from the IPpackets and is then stored in the read buffers allocated for thisoperation. Once all the data has been transferred the target respondswith command completion and sense status, block 3107. The initiator thenretires the command once the full transfer is complete, block 3109. Ifthere were any errors at the target and the command is being aborted forany reason, then a recovery procedure may be initiated separately by theinitiator. This transaction is a standard SCSI READ transaction with thedata transport over IP based storage protocol like iSCSI as the PDUs ofthat protocol.

FIG. 32 illustrates the data flow inside the IP processor of thisinvention for one of the received READ PDUs of the transactionillustrated in FIG. 31. The internal data flow is shown for the readdata PDU received by the IP processor on the initiator end. This figureillustrates various stage of operation that a packet goes through. Thestages can be considered as pipeline stages through which the packetstraverse. The number of pipe stages traversed depends on the type of thepacket received. The figure illustrates the pipe stages for a packetreceived on an established connection. The packet traverses through thefollowing major pipe stages:

1. Receive Pipe Stage of block 3201, with major steps illustrated inblock 3207: Packet is received by the media access controller. Thepacket is detected, the preamble/trailers removed and a packet extractedwith the layer2 header and the payload. This is the stage where theLayer2 validation occurs for the intended recipient as well as any errordetection. There may be quality of service checks applied as per thepolicies established. Once the packet validation is clear the packet isqueued to the input queue.

2. Security Pipe Stage of block 3202, with major steps illustrated inblock 3208. The packet is moved from the input queue to theclassification engine, where a quick determination for securityprocessing is made and if the packet needs to go through securityprocessing, it enters the security pipe stage. If the packet is receivedin clear text and does not need authentication, then the security pipestage is skipped. The security pipe stage may also be omitted if thesecurity engine is not integrated with the IP processor. The packet goesthrough various stages of security engine where first the securityassociation for this connection is retrieved from memory, and the packetis authenticated using the message authentication algorithm selected.The packet is then decrypted using the security keys that have beenestablished for the session. Once the packet is in clear text, it isqueued back to the input queue controller.

3. Classification Pipe Stage of block 3203, with major steps illustratedin block 3209. The scheduler retrieves the clear packet from the inputqueue and schedules the packet for classification. The classificationengine performs various tasks like extracting the relevant fields fromthe packet for layer 3 and higher layer classification, identifiesTCP/IP/storage protocols and the like and creates those classificationtags and may also take actions like rejecting the packet or tagging thepacket for bypass depending on the policies programmed in theclassification engine. The classification engine may also tag the packetwith the session or the flow to which it belongs along with marking thepacket header and payload for ease of extraction. Some of the taskslisted may be or may not be performed and other tasks may be performeddepending on the programming of the classification engine. As theclassification is done, the classification tag is added to the packetand packet is queued for the scheduler to process.

4. Schedule Pipe Stage of block 3204, with major steps illustrated inblock 3210. The classified packet is retrieved from the classificationengine queue and stored in the scheduler for it to be processed. Thescheduler performs the hash of the source and destination fields fromthe packet header to identify the flow to which the packet belongs, ifnot done by the classifier. Once the flow identification is done thepacket is assigned to an execution resource queue based on the flowdependency. As the resource becomes available to accept a new packet,the next packet in the queue is assigned for execution to that resource.

5. Execution Pipe Stage of block 3205, with major steps illustrated inblock 3211. The packet enters the execution pipe stage when the resourceto execute this packet becomes available. The packet is transferred tothe packet processor complex that is supposed to execute the packet. Theprocessor looks at the classification tag attached to the packet todecide the processing steps required for the packet. If this is an IPbased storage packet, then the session database entry for this sessionis retrieved. The database access may not be required if the localsession cache already holds the session entry. If the packet assignmentwas done based on the flow, then the session entry may not need to beretrieved from the global session memory. The packet processor thenstarts the TCP engine/the storage engines to perform their operations.The TCP engine performs various TCP checks including checksum, sequencenumber checks, framing checks with necessary CRC operations, and TCPstate update. Then the storage PDU is extracted and assigned to thestorage engine for execution. The storage engine interprets the commandin the PDU and in this particular case identifies it to be a readresponse for an active session. It than verifies the payload integrityand the sequence integrity and then updates the storage flow state inthe session database entry. The memory descriptor of the destinationbuffer is also retrieved from the session data base entry and theextracted PDU payload is queued to the storage flow/RDMA controller andthe host interface block for them to DMA the data to the final bufferdestination. The data may be delivered to the flow controller with thememory descriptor and the command/operation to perform. In this casedeposit the data for this active read command. The storage flowcontroller updates its active command database. The execution engineindicates to the scheduler the packet has been retired and the packetprocessor complex is ready to receive its next command.

6. DMA Pipe Stage of block 3206, with major steps illustrated in block3212. Once the storage flow controller makes the appropriateverification of the Memory descriptor, the command and the flow state,it passes the data block to the host DMA engine for transfer to the hostmemory. The DMA engine may perform priority based queuing, if such QOSmechanism is programmed or implemented. The data is transferred to thehost memory location through DMA. If this is the last operation of thecommand, then the command execution completion is indicated to the hostdriver. If this is the last operation for a command and the command hasbeen queued to the completion queue, the resources allocated for thecommand are released to accept new command. The command statistics maybe collected and transferred with the completion status as may berequired for performance analysis, policy management or other networkmanagement or statistical purposes.

FIG. 33 illustrates write command operation between an initiator and atarget. The Initiator sends a WRITE command, block 3301, to the targetto start the transaction. This command is transported as a WRITE PDU,block 3302, on the IP storage network. The receiver queues the receivedcommand in the new request queue. Once the old commands in operation arecompleted, block 3304, the receiver allocates the resources to acceptthe WRITE data corresponding to the command, block 3305. At this stagethe receiver issues a ready to transfer (R2T) PDU, block 3306, to theinitiator, with indication of the amount of data it is willing toreceive and from which locations. The initiator interprets the fields ofthe R2T requests and sends the data packets, block 3307, to the receiveras per the received R2T. This sequence of exchange between the initiatorand target continues until the command is terminated. A successfulcommand completion or an error condition is communicated to theinitiator by the target as a response PDU, which then terminates thecommand. The initiator may be required to start a recovery process incase of an error. This is not shown in the exchange of the FIG. 33.

FIG. 34 illustrates the data flow inside the IP processor of thisinvention for one of the R2T PDUs and the following write data of thewrite transaction illustrated in FIG. 33. The initiator receives the R2Tpacket through its network media interface. The packet passes throughall the stages, blocks 3401, 3402, 3403, and 3404 with detailed majorsteps in corresponding blocks 3415, 3416, 3409 and 3410, similar to theREAD PDU in FIG. 32 including Receive, Security, Classification,Schedule, and Execution. Security processing is not illustrated in thisfigure. Following these stages the R2T triggers the write data fetchusing the DMA stage shown in FIG. 34, blocks 3405 and 3411. The writedata is then segmented and put in TCP/IP packets through the executionstage, blocks 3406 and 3412. The TCP and storage session DB entries areupdated for the WRITE command with the data transferred in response tothe R2T. The packet is then queued to the output queue controller.Depending on the security agreement for the connection, the packet mayenter the security pipe stage, block 3407 and 3413. Once the packet hasbeen encrypted and message authentication codes generated, the packet isqueued to the network media interface for the transmission to thedestination. During this stage, block 3408 and 3414 the packet isencapsulated in the Layer 2 headers, if not already done so by thepacket processor and is transmitted. The steps followed in each stage ofthe pipeline are similar to that of the READ PDU pipe stages above, withadditional stages for the write data packet stage, which is illustratedin this figure. The specific operations performed in each stage dependon the type of the command, the state of the session, the command stateand various other configurations for policies that may be setup.

FIG. 35 illustrates the READ data transfer using RDMA mechanism betweenand initiator and target. The initiator and target register the RDMAbuffers before initiating the RDMA data transfer, blocks 3501, 3502, and3503. The initiator issues a READ command, block 3510, with the RDMAbuffer as the expected recipient. This command is transported to thetarget, block 3511. The target prepares the data to be read, block 3504,and then performs the RDMA write operations, block 3505 to directlydeposit the read data into the RDMA buffers at the initiator without thehost intervention. The operation completion is indicated using thecommand completion response.

FIG. 36 illustrates the internal architecture data flow for the RDMAWrite packet implementing the READ command flow. The RDMA write packetalso follows the same pipe stages as any other valid data packet that isreceived on the network interface. This packet goes through Layer 2processing in the receive pipe stage, blocks 3601 and 3607, from whereit is queued for scheduler to detect the need for security processing.If the packet needs to be decrypted or authenticated, it enters thesecurity pipe stage, blocks 3602 and 3608. The decrypted packet is thenscheduled to the classification engine for it to perform theclassification tasks that have been programmed, blocks 3603 and 3609.Once classification is completed, the tagged packet enters the schedulepipe stage, blocks 3604 and 3610, where the scheduler assigns thispacket to a resource specific queue dependent on flow based scheduling.When the intended resource is ready to execute this packet, it istransferred to that packet processor complex, blocks 3605 and 3611,where all the TCP/IP verification, checks, and state updates are madeand the PDU is extracted. Then the storage engine identifies the PDU asbelonging to a storage flow for storage PDUs implemented using RDMA andinterprets the RDMA command. In this case it is RDMA write to a specificRDMA buffer. This data is extracted and passed on to the storageflow/RDMA controller block which performs the RDMA region translationand protection checks and the packet is queued for DMA through the hostinterface, blocks 3606 and 3612. Once the packet has completed operationthrough the packet processor complex, the scheduler is informed and thepacket is retired from the states carried in the scheduler. Once in theDMA stage, the RDMA data transfer is completed and if this is the lastdata transfer that completes the storage command execution, that commandis retired and assigned to the command completion queue.

FIG. 37 illustrates the storage write command execution using RDMA Readoperations. The initiator and target first register their RDMA bufferswith their RDMA controllers and then also advertise the buffers to theirpeer. Then the initiator issues a write command, block 3701, to thetarget, where it is transported using the IP storage PDU. The recipientexecutes the write command, by first allocating the RDMA buffer toreceive the write and then requesting an RDMA read to the initiator,blocks 3705, and 3706. The data to be written from the initiator is thenprovided as an RDMA read response packet, blocks 3707 and 3708. Thereceiver deposits the packet directly to the RDMA buffer without anyhost interaction. If the read request was for data larger than thesegment size, then multiple READ response PDUs would be sent by theinitiator in response to the READ request. Once the data transfer iscomplete the completion status is transported to the initiator and thecommand completion is indicated to the host.

FIG. 38 illustrates the data flow of an RDMA Read request and theresulting write data transfer for one section of the flow transactionillustrated in FIG. 37. The data flow is very similar to the write dataflow illustrated in FIG. 34. The RDMA read request packet flows throughvarious processing pipe stages including: receive, classify, schedule,and execution, blocks 3801, 3802, 3803, 3804, 3815, 3816, 3809 and 3810.Once this request is executed, it generates the RDMA read responsepacket. The RDMA response is generated by first doing the DMA, blocks3805 and 3811, of the requested data from the system memory, and thencreating segments and packets through the execution stage, blocks 3806and 3812. The appropriate session database entries are updated and thedata packets go to the security stage, if necessary, blocks 3807 and3813. The secure or clear packets are then queued to the transmit stage,block 3808 and 3814, which performs the appropriate layer 2 updates andtransmits the packet to the target.

FIG. 39 illustrates an initiator command flow for the storage commandsinitiated from the initiator in more details. As illustrated followingare some of the major steps that a command follows:

1. Host driver queues the command in processor command queue in thestorage flow/RDMA controller;

2. Host is informed if the command is successfully scheduled foroperation and to reserve the resources;

3. The storage flow/RDMA controller schedules the command for operationto the packet scheduler, if the connection to the target is established.Otherwise the controller initiates the target session initiation andonce session is established the command is scheduled to the packetscheduler;

4. The scheduler assigns the command to one of the SAN packet processorsthat is ready to accept this command;

5. The processor complex sends a request to the session controller forthe session entry;

6. The session entry is provided to the packet processor complex;

7. The packet processor forms a packet to carry the command as a PDU andis scheduled to the output queue; and

8. The command PDU is given to the network media interface, which sendsit to the target.

This is the high level flow primarily followed by most commands from theinitiator to the target when the connection has been established betweenan initiator and a target.

FIG. 40 illustrates read packet data flow in more detail. Here the readcommand is initially send using a flow similar to that illustrated inFIG. 39 from the initiator to the target. The target sends the readresponse PDU to the initiator which follows the flow illustrated in FIG.40. As illustrated the read data packet passes through following majorsteps:

1. Input packet is received from the network media interface block;

2. Packet scheduler retrieves the packet from the input queue;

3. Packet is scheduled for classification;

4. Classified packet returns from the classifier with a classificationtag;

5. Based on the classification and flow based resource allocation, thepacket is assigned to a packet processor complex which operates on thepacket;

6. Packet processor complex looks-up session entry in the session cache(if not present locally);

7. Session cache entry is returned to the packet processor complex;

8. Packet processor complex performs the TCP/IP operations/IP storageoperations and extracts the read data in the payload. The read data withappropriate destination tags like MDL (memory descriptor list) isprovided to the host interface output controller; and

9. The host DMA engine transfers the read data to the system buffermemory.

Some of these steps are provided in more details in FIG. 32, where asecure packet flow is represented, where as the FIG. 40 represents aclear text read packet flow. This flow and other flows illustrated inthis patent are applicable to storage and non-storage data transfers byusing appropriate resources of the disclosed processor, that a personwith ordinary skill in the art will be able to do with the teachings ofthis patent.

FIG. 41 illustrates the write data flow in more details. The writecommand follows the flow similar to that in FIG. 39. The initiator sendsthe write command to the target. The target responds to the initiatorwith a ready to transfer (R2T) PDU which indicates to the initiator thatthe target is ready to receive the specified amount of data. Theinitiator then sends the requested data to the target. FIG. 41illustrates the R2T followed by the requested write data packet from theinitiator to the target. The major steps followed in this flow are asfollows:

1. Input packet is received from the network media interface block;

2. Packet scheduler retrieves the packet from the input queue;

3. Packet is scheduled for classification;

4. Classified packet returns from the classifier with a classificationtag; [0265] a. Depending on the classification and flow based resourceallocation, the packet is assigned to a packet processor complex whichoperates on the packet;

5. Packet processor complex looks-up session entry in the session cache(if not present locally);

6. Session cache entry is returned to the packet processor complex;

7. The packet processor determines the R2T PDU and requests the writedata with a request to the storage flow/RDMA Controller;

8. The flow controller starts the DMA to the host interface;

9. Host interface performs the DMA and returns the data to the hostinput queue;

10. The packet processor complex receives the data from the host inputqueue;

11. The packet processor complex forms a valid PDU and packet around thedata, updates the appropriate session entry and transfers the packet tothe output queue; and

12. The packet is transferred to the output network media interfaceblock which transmits the data packet to the destination.

The flow in FIG. 41 illustrates clear text data transfer. If the datatransfer needs to be secure, the flow is similar to that illustrated inFIG. 43, where the output data packet is routed through the securepacket as illustrated by arrows labeled 11 a and 11 b. The input R2Tpacket, if secure would also be routed through the security engine (thisis not illustrated in the figure).

FIG. 42 illustrates the read packet flow when the packet is in ciphertext or is secure. This flow is illustrated in more details in FIG. 32with its associated description earlier. The primary difference betweenthe secure read flow and the clear read flow is that the packet isinitially classified as secure packet by the classifier, and hence isrouted to the security engine. These steps are illustrated by arrowslabeled 2 a, 2 b, and 2 c. The security engine decrypts the packet andperforms the message authentication, and transfers the clear packet tothe input queue for further processing as illustrated by arrow labeled 2d. The clear packet is then retrieved by the scheduler and provided tothe classification engine as illustrated by arrows labeled 2 e and 3 inFIG. 42. The rest of the steps and operations are the same as that inFIG. 40, described above.

FIG. 44 illustrates the RDMA buffer advertisement flow. This flow isillustrated to be very similar to any other storage command flow asillustrated in the FIG. 39. The detailed actions taken in the majorsteps are different depending on the command. For RDMA bufferadvertisement and registration, the RDMA region id is created andrecorded along with the address translation mechanism for this region isrecorded. The RDMA registration also includes the protection key for theaccess control and may include other fields necessary for RDMA transfer.The steps to create the packet for the command are similar to those ofFIG. 39.

FIG. 45 illustrates the RDMA write flow in more details. The RDMA writesappear like normal read PDUs to the initiator receiving the RDMA write.The RDMA write packet follows the same major flow steps as a read PDUillustrated in FIG. 40. The RDMA transfer involves the RDMA addresstranslation and region access control key checks, and updating the RDMAdatabase entry, beside the other session entries. The major flow stepsare the same as the regular Read response PDU.

FIG. 46 illustrates the RDMA Read data flow in more details. Thisdiagram illustrates the RDMA read request being received by theinitiator from the target and the RDMA Read data being written out fromthe initiator to the target. This flow is very similar to the R2Tresponse followed by the storage write command In this flow the storagewrite command is accomplished using RDMA Read. The major steps that thepacket follows are primarily the same as the R2T/write data flowillustrated in FIG. 41.

FIG. 47 illustrates the major steps of session creation flow. Thisfigure illustrates the use of the control plane processor for this slowpath operation required at the session initiation between an initiatorand a target. This functionality is possible to implement through thepacket processor complex. However, it is illustrated here as beingimplemented using the control plane processor. Both approaches areacceptable. Following are the major steps during session creation:

1. The command is scheduled by the host driver;

2. The host driver is informed that the command is scheduled and anycontrol information required by the host is passed;

3. The storage flow/RDMA controller detects a request to send thecommand to a target for which a session is not existing, and hence itpasses the request to the control plane processor to establish thetransport session;

4. Control plane processor sends a TCP SYN packet to the output queue;

5. The SYN packet is transmitted to the network media interface fromwhich is transmitted to the destination;

6. The destination, after receiving the SYN packet, responds with theSYN-ACK response, which packet is queued in the input queue on receiptfrom the network media interface;

7. The packet is retrieved by the packet scheduler;

8. The packet is passed to the classification engine;

9. The tagged classified packet is returned to the scheduler;

10. The scheduler, based on the classification, forwards this packet tocontrol plane processor;

11. The processor then responds with an ACK packet to the output queue;

12. The packet is then transmitted to the end destination thus finishingthe session establishment handshake; and

13. Once the session is established, this state is provided to thestorage flow controller. The session entry is thus created which is thenpassed to the session memory controller (this part not illustrated inthe figure).

Prior to getting the session in the established state as in step 13, thecontrol plane processor may be required to perform a full login phase ofthe storage protocol, exchanging parameters and recording them for thespecific connection if this is a storage data transfer connection. Oncethe login is authenticated and parameter exchange complete, does thesession enter the session establishment state shown in step 13 above.

FIG. 48 illustrates major steps in the session tear down flow. The stepsin this flow are very similar to those in FIG. 47. Primary differencebetween the two flows is that, instead of the SYN, SYN-ACK and ACKpackets for session creation, FIN, FIN-ACK and ACK packets aretransferred between the initiator and the target. The major steps areotherwise very similar. Another major difference here is that theappropriate session entry is not created but removed from the sessioncache and the session memory. The operating statistics of the connectionare recorded and may be provided to the host driver, although this isnot illustrated in the figure.

FIG. 49 illustrates the session creation and session teardown steps froma target perspective. Following are the steps followed for the sessioncreation:

1. The SYN request from the initiator is received on the network mediainterface;

2. The scheduler retrieves the SYN packet from the input queue;

3. The scheduler sends this packet for classification to theclassification engine;

4. The classification engine returns the classified packet withappropriate tags;

5. The scheduler, based on the classification as a SYN packet, transfersthis packet to the control plane processor;

6. Control plane processor responds with a SYN-ACK acknowledgementpacket. It also requests the host to allocate appropriate buffer spacefor unsolicited data transfers from the initiator (this part is notillustrated);

7. The SYN-ACK packet is sent to the initiator;

8. The initiator then acknowledges the SYN-ACK packet with an ACKpacket, completing the three-way handshake. This packet is received atthe network media interface and queued to the input queue after layer 2processing;

9. The scheduler retrieves this packet;

10. The packet is sent to the classifier;

11. Classified packet is returned to the scheduler and is scheduled tobe provided to the control processor to complete the three wayhandshake;

12. The controller gets the ACK packet;

13. The control plane processor now has the connection in an establishedstate and it passes the to the storage flow controller which creates theentry in the session cache; and

14. The host driver is informed of the completed session creation.

The session establishment may also involve the login phase, which is notillustrated in the FIG. 49. However, the login phase and the parameterexchange occur before the session enters the fully configured andestablished state. These data transfers and handshake may primarily bedone by the control processor. Once these steps are taken the remainingsteps in the flow above may be executed.

FIGS. 50 and 51 illustrate write data flow in a target subsystem. TheFIG. 50 illustrates an R2T command flow, which is used by the target toinform the initiator that it is ready to accept a data write from theinitiator. The initiator then sends the write which is received at thetarget and the internal data flow is illustrated in FIG. 51. The twofigures together illustrate one R2T and data write pairs. Following arethe major steps that are followed as illustrated in FIGS. 50 and 51together:

1. The target host system in response to receiving a write request likethat illustrated in FIG. 33, prepares the appropriate buffers to acceptthe write data and informs the storage flow controller when it is ready,to send the ready to transfer request to the initiator;

2. The flow controller acknowledges the receipt of the request and thebuffer pointers for DMA to the host driver;

3. The flow controller then schedules the R2T command to be executed tothe scheduler;

4. The scheduler issues the command to one of the packet processorcomplexes that is ready to execute this command;

5. The packet processor requests the session entry from the sessioncache controller;

6. The session entry is returned to the packet processor;

7. The packet processor forms a TCP packet and encapsulates the R2Tcommand and sends it to the output queue;

8. The packet is then sent out to network media interface which thensends the packet to the initiator. The security engine could beinvolved, if the transfer needed to be secure transfer;

9. Then as illustrated in FIG. 51, the initiator responds to R2T bysending the write data to the target. The network media interfacereceives the packet and queues it to the input queue;

10. The packet scheduler retrieves the packet from the input queue;

11. The packet is scheduled to the classification engine;

12. The classification engine provides the classified packet to thescheduler with the classification tag. The flow illustrated is forunencrypted packet and hence the security engine is not exercised;

13. The scheduler assigns the packet based on the flow based resourceassignment queue to packet processor queue. The packet is thentransferred to the packet processor complex when the packet processor isready to execute this packet;

14. The packet processor requests the session cache entry (if it doesnot already have it in its local cache);

15. The session entry is returned to the requesting packet processor;

16. The packet processor performs all the TCP/IP functions, updates thesession entry and the storage engine extracts the PDU as the writecommand in response to the previous R2T. It updates the storage sessionentry and routes the packet to the host output queue for it to betransferred to the host buffer. The packet may be tagged with the memorydescriptor or the memory descriptor list that may be used to perform theDMA of this packet into the host allocated destination buffer; and

17. The host interface block performs the DMA, to complete this segmentof the Write data command.

FIG. 52 illustrates the target read data flow. This flow is very similarto the initiator R2T and write data flow illustrated in FIG. 41. Themajor steps followed in this flow are as follows:

1. Input packet is received from the network media interface block;

2. Packet scheduler retrieves the packet from the input queue;

3. Packet is scheduled for classification;

4. Classified packet returns from the classifier with a classificationtag; [0334] a. Depending on the classification and flow based resourceallocation, the packet is assigned to a packet processor complex whichoperates on the packet

5. Packet processor complex looks-up session entry in the session cache(if not present locally);

6. Session cache entry is returned to the packet processor complex;

7. The packet processor determines the Read Command PDU and requests theread data with a request to the flow controller;

8. The flow controller starts the DMA to the host interface;

9. Host interface performs the DMA and returns the data to the hostinput queue;

10. The packet processor complex receives the data from the host inputqueue;

11. The packet processor complex forms a valid PDU and packet around thedata, updates the appropriate session entry and transfers the packet tothe output queue; and

12. The packet is transferred to the output network media interfaceblock which transmits the data packet to the destination.

The discussion above of the flows is an illustration of some the majorflows involved in high bandwidth data transfers. There are several flowslike fragmented data flow, error flows with multiple different types oferrors, name resolution service flow, address resolution flows, loginand logout flows, and the like are not illustrated, but are supported bythe IP processor of this invention.

As discussed in the description above, the perimeter security model isnot sufficient to protect an enterprise network from security threatsdue to the blurring boundary of enterprise networks. Further, asignificant number of unauthorized information access occurs frominside. The perimeter security methods do not prevent such securityattacks. Thus it is critical to have security deployed across thenetwork and protect the network from within as well as the perimeter.The network line rates inside enterprise networks are going to 1 Gbps,multi-Gbps and 10 Gbps in the LANs and SANs. As previously mentioned,distributed firewall and security methods require a significantprocessing overhead on each of the system host CPU if implemented insoftware. This overhead can cause increase in latency of the response ofthe servers, reduce their overall throughput and leave fewer processingcycles for applications. An efficient hardware implementation that canenable deployment of software driven security services is required toaddress the issues outlined above. The processor of this patentaddresses some of these key issues. Further, at high line rates it iscritical to offload the software based TCP/IP protocol processing fromthe host CPU to protocol processing hardware to reduce impact on thehost CPU. Thus, the protocol processing hardware should provide themeans to perform the security functions like firewall, encryption,decryption, VPN and the like. The processor provides such a hardwarearchitecture that can address the growing need of distributed securityand high network line rates within enterprise networks.

FIG. 53 illustrates a traditional enterprise network with perimeterfirewall. This figure illustrates local area network and storage areanetworks inside enterprise networks. The figure illustrates a set ofclients, 5301(1) though 5301(n), connected to an enterprise networkusing wireless LAN. There may be multiple clients of different typeslike handheld computers, PCs, thin clients, laptops, notebook computers,tablet PCs and the like. Further, they may connect to the enterprise LANusing wireless LAN access points (WAP), 5303. There may be one or moreWAP connected to the LAN. Similarly, the figure also illustratesmultiple clients connected to the enterprise LAN through wired network.These clients may be on different sub segments or the same segment or bedirectly linked to the switches in a point to point connection,depending on the size of the network, the line rates and the like. Thenetwork may have multiple switches and routers that provide the internalconnectivity for the network of devices. The figure also illustratesnetwork attached storage devices, 5311, providing network file servingand storage services to the clients. The figure also illustrates one ormore servers, 5307(1) through 5307(n) and 5308(1) through 5308(n),attached to the network providing various application services beinghosted on these servers to the clients inside the network as well asthose being accessed through the outside as web access or other networkaccess. The servers in the server farm may be connected in a traditionalthree-tier or n-tier network providing different services like webserver, application servers, database servers, and the like. Theseservers may hold direct attached storage devices for the needed storageand/or connect to a storage area network (SAN), using SAN connectivityand switches, 5309(1) through 5309(n) to connect to the storage systems,5310(1) through 5310(n) for their storage needs. The storage areanetwork may also be attached to the LAN using gateway devices, 5313 toprovide the access to storage system to the LAN clients. The storagesystems may also be connected to the LAN directly, similar to NAS, 5311,to provide block storage services using protocols like iSCSI and thelike. This is not illustrated in the figure. The network illustrated inthis figure is secured from the external network by the perimeterfirewall, 5306. As illustrated in this figure the internal network insuch an environment does not enable security, which poses serioussecurity vulnerabilities to insider attacks.

FIG. 54 illustrates an enterprise network with a distributed firewalland security capabilities. The network configuration illustrated issimilar to that in FIG. 53. The distributed security features shown insuch a network may be configured, monitored, managed, enabled andupdated from a set of central network management systems by central ITmanager(s), 5412. The manager(s) is(are) able to set the distributedsecurity policy from management station(s), distribute appropriatepolicy rules to each node enabled to implement the distributed securitypolicy and monitor any violations or reports from the distributedsecurity processors using the processor of this patent. The network maybe a network that comprises of one or more nodes, one or more managementstations or a combination thereof The figure illustrates that the SANdevices are not under the distributed security network. The

SAN devices in this figure may be under a separate security domain ormay be trusted to be protected from insiders and outsiders with thesecurity at the edge of the SAN.

FIG. 55 illustrates an enterprise network with a distributed firewalland security capabilities where the SAN devices are also under adistributed security domain. The rest of the network configuration maybe similar to that in FIG. 54. In this scenario, the SAN devices mayimplement similar security policies as the rest of the network devicesand may be under the control from the same IT management systems. TheSAN security may be implemented different from the rest of the network,depending on the security needs, sensitivity of the information andpotential security risks. For instance, the SAN devices may implementfull encryption/decryption services beside firewall securitycapabilities to ensure that no unauthorized access occurs as well as thedata put out on the SAN is always in a confidential mode. These policiesand rules may be distributed from the same network management systems orthere may be special SAN management systems, not shown, that may be usedto create such distributed secure SANs. The systems in FIG. 54 and FIG.55 use the processor and the distributed security system of this patent.

FIG. 56 illustrates a central manager/policy server and monitoringstation, also called the central manager. The central manager includessecurity policy developer interface, block 5609, which is used by the ITmanager(s) to enter the security policies of the organization. Thesecurity policy developer interface may be a command line interface, ascripting tool, a graphical interface or a combination thereof which mayenable the IT manager to enter the security policies in a securitypolicy description language. It may also provide access to the ITmanager remotely under a secure communication connection. The securitypolicy developer interface works with a set of rule modules that enablesthe IT manager to enter the organization's policies efficiently. Therule modules may provide rule templates that may be filled in by the ITmanagers or may be interactive tools that ease the entry of the rules.These modules provide the rules based on the capabilities that aresupported by the distributed security system. Networking layers 2through 4 (L2, L3, L4) rules, rule types, templates, and the like isprovided by block 5601 to the security developer interface. These rulesmay comprise of IP addresses for source, destination, L2 addresses forsource, destination, L2 payload type, buffer overrun conditions, type ofservice, priority of the connection, link usage statistics and the likeor a combination thereof The Protocol/port level rules, block 5602,provides rules, rule types, templates and the like to the securitydeveloper interface. These rules may comprise of protocol type like IP,TCP, UDP, ICMP, IPSEC, ARP, RARP or the like, or source port, ordestination port including well-known ports for known upper levelapplications/protocols, or a combination thereof The block 5603 providesapplication level or upper layer (L5 through L7) rules, rule types,templates and the like to the security developer interface. These rulesmay comprise rules that are dependent on a type of upper layerapplication or protocol like HTTP, XML, NFS, CIFS, iSCSI, iFCP, FCIP,SSL, RDMA or the like, their usage model, their vulnerabilities or acombination thereof The content based rules, block 5604, provide rules,rule types, templates, or the like to the security developer interfacefor entering content dependent rules. These rules may evolve over time,like the other rules, to cover known threats or potential new threatsand comprise of a wide variety of conditions like social securitynumbers, confidential/proprietary documents, employee records, patientrecords, credit card numbers, offending URLs, known virus signatures,buffer overrun conditions, long web addresses, offending language,obscenities, spam, or the like or a combination thereof These rules,templates or the rule types may be provided for ease of creation ofrules in the chosen policy description language(s) for the manager ofthe distributed security system. Security policy developer interface mayexist without the rules modules and continue to provide means to the ITmanagers to enter the security policies in the system. The rulesrepresented in the security policy language entered through theinterface would then get compiled by the security rules compiler, block5611, for distribution to the network nodes. Security rules compilerutilizes a network connectivity database, 5605, and a nodes capabilitiesand characteristics database, 5606, to generate rules specific for eachnode in the network that is part of monitoring/enforcing the securitypolicy. The network connectivity database comprises physical adjacencyinformation, or physical layer connectivity, or link layer connectivity,or network layer connectivity, or OSI layer two addresses or OSI layerthree addresses or routing information or a combination thereof. Thenodes capabilities and characteristics database comprises hardwaresecurity features or software security features or size of the rulesengine or performance of the security engine(s) or quality of servicefeatures or host operating system or hosted application(s) or line ratesof the network connectivity or host performance or a combinationthereof. The information from these databases would enable the securityrules compiler to properly map security policies to node specific rules.The node specific rules and general global rules are stored to andretrieved from the rules database, 5607. The security rules compilerthen works with the rules distribution engine, 5608, to distribute thecompiled rules to each node. The rules distribution engine interactswith each security node of the distributed security system to send therule set to be used at that specific node. The rule distribution enginemay retrieve the rule sets directly from the rules database or work withthe security rules compiler or a combination thereof to retrieve therules. Once the rules are proliferated to respective nodes the centralmanager starts monitoring and managing the network.

The central manager works with each node in the security network tocollect events or reports of enforcement, statistics, violations and thelike using the event and report collection/management engine, 5616. Theevent/report collection engine works with the security monitoringengine, 5613, to create the event and information report databases, 5614and 5615, which keep a persistent record of the collected information.The security monitoring engine analyzes the reports and events to checkfor any violations and may in turn inform the IT managers about thesame. Depending on the actions to be taken when violations occur, thesecurity monitoring engine may create policy or rule updates that may beredistributed to the nodes. The security monitoring engine works withthe security policy manager interface, 5612, and policy update engine,5610, for getting the updates created and redistributed. The securitypolicy manager interface provides tools to the IT manager to do eventand information record searches. The IT manager may be able to developnew rules or security policy updates based on the monitored events orother searches or changes in the organizations policies and create theupdates to the policies. These updates get compiled by the securitypolicy compiler and redistributed to the network. The functionality ofsecurity policy manager interface, 5612, and policy update engine, 5610,may be provided by the security policy developer interface, 5609, basedon an implementation choice. Such regrouping of functionality andfunctional blocks is possible without diverging from the teachings ofthis patent. The security monitoring engine, the security policy managerinterface and the event/report collection/management interface may alsobe used to manage specific nodes when there are violations that need tobe addressed or any other actions need to be taken like enabling a nodefor security, disabling a node, changing the role of a node, changingthe configuration of a node, starting/stopping/deploying applications ona node, or provisioning additional capacity or other managementfunctions or a combination thereof as appropriate for the centralmanager to effectively manage the network of the nodes.

FIG. 57 illustrates the central manager flow of this patent. The centralmanager may comprise various process steps illustrated by the blocks ofthe flow. The IT manager(s) create and enter the security policies ofthe organization in central management system(s) that are illustrated byblock 5701. The policies are then compiled into rules, by the securitypolicy compiler, using a network connectivity database and a nodecapabilities and characteristics database as illustrated by block 5702.The central manager then identifies the nodes from the network that havesecurity capability enabled, from the node characteristics database, inblock 5703, to distribute rules to these nodes. The manager may thenselect a node from these nodes, as illustrated by block 5704, andretrieve the corresponding security rules from the rules database, asillustrated by block 5705, and then communicate the rules to the node,as illustrated by 5706, and further illustrated by FIG. 58. The centralmanager continues the process of retrieving the rules and communicatingthe rules until all nodes have been processed as illustrated by thecomparison of all nodes done in block 5707. Once rules have beendistributed to all the nodes, the central manager goes into managing andmonitoring the network for policy enforcements, violations or othermanagement tasks as illustrated by block 5708. If there are any policyupdates that result from the monitoring, the central manager exits themonitoring to create and update new policy through checks illustrated byblocks, 5709 and 5710. If there are new policy updates, the centralmanager traverses through the flow of FIG. 57 to compile the rules andredistribute them to the affected nodes and then continue to monitor thenetwork. The event collection engine of the central manager continues tomonitor and log events and information reports, when other modules areprocessing the updates to the security policies and rules. Thus thenetwork is continuously monitored when the rule updates and distributionis in progress. Once the rule updates are done, the security monitoringengine and other engines process the collected reports. Communication ofrules to the nodes and monitoring/managing of the nodes may be done inparallel to improve the performance as well as effectiveness of thesecurity system. Central manager may communicate new rules or updates tomultiple nodes in parallel instead of using a serial flow, and assignthe nodes that have already received the rules into monitoring/managingstate for the central manager. Similarly the policy creation or updatescan also be performed in parallel to the rule compilation, distributionand monitoring.

FIG. 58 illustrates the rule distribution flow of this patent. The ruledistribution engine working with the security policy compiler, retrievesthe rules or rule set to be communicated to a specific node asillustrated by 5801. It then initiates communication with the selectednode as illustrated by 5802. The central manager and the node mayauthenticate each other using agreed upon method or protocol asillustrated by 5803. Authentication may involve a complete loginprocess, or secure encrypted session or a clear mode session or acombination thereof. Once the node and the central managers authenticateeach other, the communication is established between the central managerand the control plane processor or host based policy driver of the nodeas illustrated by 5804. Once the communication is established, the ruledistribution engine sends the rules or rule set or updated rules or acombination thereof to the node as illustrated in 5805. This exchange ofthe rules may be over a secure/encrypted session or clear link dependenton the policy of the organization. The protocol deployed to communicatethe rules may be using a well known protocol or a proprietary protocol.Once the rule set has been sent to the node, the central manager maywait to receive the acknowledgement from the node of successfulinsertion of the new rules at the node as illustrated by 5806. Once asuccessful acknowledgement is received the rule distribution flow forone node concludes as illustrated by 5807. The appropriate rule databaseentries for the node would be marked with the distribution completionstatus. The flow of FIG. 58 is repeated for all nodes that need toreceive the rules from the rule distribution engine of the centralmanager. The rule distribution engine may also be able to distributerules in parallel to multiple nodes to improve the efficiency of therule distribution process. In this scenario the rule distribution enginemay perform various steps of the flow like authenticate a node,establish communication with a node, send rule or rules to a node andthe like in parallel for multiple nodes.

FIG. 59 illustrates a control plane processor or a host based policydriver flow of this patent. This flow is executed on each node followingthe distributed security of this patent, comprising a hardwareprocessor. Upon initiation of policy rule distribution by the centralmanager or upon reset or power up or other management event or acombination thereof the policy driver establishes communication with thecentral manager/policy server as illustrated by 5901. The policy driverreceives the rule set or updates to existing rules from the centralmanager as illustrated by 5902. If the rules are formatted to beinserted into the specific policy engine implementation, size and thelike, the rules are accepted to be configured in the policy engine. Ifthe rules are always properly formatted by the central manager it isfeasible to avoid performing the check illustrated in block 5903.Otherwise, if the rules are not always formatted or otherwise ready tobe directly inserted in the policy engine, as determined in block 5903,the driver configures the rules for the node as illustrated by block5904. The driver then communicates with the database initialization andmanagement interface, block 2011 of FIG. 20, of the policy engine of theprocessor. This is illustrated by block 5905. Then the driver sends arule to the policy engine which updates it in the engine datastructures, like that in FIG. 30, which comprises of a ternary or binaryCAM, associated memory, ALU, database description and other elements inthe classification/policy engine of FIG. 20. This is illustrated byblock 5906. This process continues until all the rules have been enteredin the policy engine through the decision process illustrated by 5907,5908 and 5906. Once all rules have been entered, the policy engineactivates the new rules working with the driver as illustrated by block5909. The driver then updates/sends the rules to a persistent storagefor future reference and/or retrieval as illustrated by block 5910. Thedriver then communicates to the central manager/policy server of theupdate completion and new rules activation in the node as illustrated byblock 5911. The policy driver may then enter a mode of communicating themanagement information, events, reports to the central manager. Thispart of the driver is not illustrated in the figure. The managementfunctionality may be taken up by a secure process on the host or thecontrol plane processor of the node. The mechanisms described aboveallow a secure operating environment to be created for the protocolstack processing, where even if the host system gets compromised eitherthrough a virus or malicious attack, it allows the network security andintegrity to be maintained since a control plane processor based policydriver does not allow the host system to influence the policies or therules. The rules that are active in the policy engine would prevent avirus or intruder to use this system or node to be used for furthervirus proliferation or attacking other systems in the network. The rulesmay also prevent the attacker from extracting any valuable informationfrom the system like credit card numbers, social security numbers,medical records or the like. This mechanism significantly adds to thetrusted computing environment needs of the next generation computingsystems. Some or all portions of the flow may be performed in parallelas well as some portions may be combined together. For instance, one ormore rules may be communicated together by the policy driver to thedatabase initialization/management interface, block 2011, which may thenupdate the rules in the policy engine in an atomic fashion instead ofdoing it one rule at a time. Further, while new rules are being receivedby the policy driver or the policy engine or a combination thereof, thehardware processor may continue to perform rule enforcement and analysiswith the active rule set in parallel on the incoming or outgoing networktraffic.

FIG. 60 illustrates rules that may be deployed in a distributed securitysystem using this patent. The IT manager(s) may decide the policies thatneed to be deployed for different types of accesses. These policies areconverted into rules at the central management system, 5512 or 5412, fordistribution to each node in the network that implements one or moresecurity capabilities. The rules are then provided to the processor onthe related node. A control plane processor, 1711 of FIG. 17, workingwith classification and policy engine, 1703, and the DBInitialization/management control interface, 2011 of FIG. 20, of theprocessor configure the rule in the processor. Each node implementingthe distributed security system may have unique rules that need to beapplied on the network traffic passing through, originating orterminating at the node. The central management system interacts withall the appropriate nodes and provides each node with its relevantrules. The central management system also interacts with the controlplane processor which works with the classification/policy engine of thenode to retrieve rule enforcement information and other managementinformation from the node for distributed security system.

FIG. 60 illustrates rules that may be applicable to one or more nodes inthe network. The rules may contain more or fewer fields than indicatedin the figure. In this illustration, the rules comprise the direction ofthe network traffic to which the rule is applicable, either In or Out;the source and destination addresses, which may belong to an internalnetwork node address or address belonging to a node external to thenetwork; protocol type of the packet, e.g TCP, UDP, ICMP and the like aswell as source port and destination ports and any other deep packetfields comprising URL information, sensitive information like creditcard numbers or social security numbers, or any other protectedinformation like user names, passwords and the like. The rule thencontains an action field that indicates the action that needs to betaken when a certain rule is matched. The action may comprise of varioustypes like permit the access, deny the access, drop the packet, closethe connection, log the request, send an alert or combination of theseor more actions as may be appropriate to the rule matched. The rules maybe applied in a priority fashion from top to bottom or any other orderas may be implemented in the system. The last rule indicates a conditionwhen none of the other rules match and, as illustrated in this example,access is denied.

FIG. 61 illustrates TCP/IP processor version of the IP processorillustrated in FIG. 16 and FIG. 17. This processor consists of a networkinterface block 6101, which is used to connect this processor to thenetwork. The network interface may be a wired or wireless Ethernetinterface, Packet over Sonet interface, Media Independent Interface(MII), GMII, XGMII, XAUI, System Packet Interface, SPI 4 or SPI 5 orother SPI derivatives or other network protocol interface or acombination thereof. This is the interface used to send or receivepackets to or from the network to which this processor is coupled.Intelligent flow controller and packet buffer block 6103, providespacket scheduler functionality of block 1702 of FIG. 17 as well as theinput and output queue controller functionality of block 1701 and 1712.Programmable classification/Rule Engine/Security Processing block 6102,provides the classification and policy/rule processing functionality ofblock 1703 as well as the security processing functionality of the block1705 when security capabilities are supported by the specificimplementation.

TCP/IP packet processor engines of block 6104, are similar to the TCP/IPprocessor engine of SAN packet processor of blocks 1706(a) through1706(n). The connection (session) memory block 6105, provides thefunctionality of IP session cache/memory of block 1704, whereas theconnection manager and control plane processor of block 6106 provide thesession controller and control plane processor functionality similar tothat of blocks 1704 and 1711. The RDMA controller block 6107, providesRDMA functionality similar to the block 1708. The memory controllerblock 6109, provides memory interface similar to that provided by memorycontroller of block 1704. The TCP/IP processor may have external memorywhich may be SRAM, DRAM, FLASH, ROM, EEPROM, DDR SDRAM, RDRAM, FCRAM,QDR SRAM, Magnetic RAM or Magnetic memory or other derivatives of staticor dynamic random access memory or a combination thereof.Host/Fabric/Network Interface block 6108 provides the interface to ahost bus or a switch fabric interface or a network interface dependingon the system in which this processor is being incorporated. For examplein a server or server adapter environment the block 6108 would provide ahost bus interface functionality similar to that of block 1710, wherethe host bus may be a PCI bus, PCI-X, PCI-Express, or other PCIderivatives or other host buses like AMBA bus, or RapidIO bus orHyperTransport or other derivatives. A switch or a router or a gatewayor an appliance with a switch fabric to connect multiple line cardswould have appropriate fabric interface functionality for block 6108.This may include queues with priority mechanisms to avoid head of theline blocking, fragmentation and defragmentation circuitry as needed bythe switch fabric, and appropriate flow control mechanism to ensureequitable usage of the switch fabric resources. In case of anenvironment like a gateway or appliance that connects to a network oningress and egress, the block 6108 would provide network interfacefunctionality similar to the block 6101.

The TCP/IP processor illustrated in FIG. 61 is a version of thearchitecture shown in FIG. 16 and FIG. 17 as is evident from thedescription above. The TCP/IP processor engines of block 6104 may besubstituted with SAN packet processors of block 1706(a) through 1706(n)and the two architectures would offer the same functionality. Thus FIG.61 can be looked at as a different view and/or grouping of thearchitecture illustrated in FIG. 16 and FIG. 17. The TCP/IP processorengines may be augmented by the packet engine block of the SAN packetprocessors to provide programmable processing where additional servicescan be deployed besides protocol processing on a packet by packet basis.Block 6110 of FIG. 61, illustrated as a dotted line around a group ofblocks, is called “TCP/IP processor core” in this patent. The RDMA block6107 is shown to be part of the TCP/IP Processor core although it is anoptional block in certain TCP/IP processor core embodiments like lowline speed applications or applications that do not support RDMA.Similarly the security engine may also not be present depending on theimplementation chosen and the system embodiment.

FIG. 62 illustrates an Adaptable TCP/IP processor of this patent. Thisprocessor comprises of a network interface block 6201,host/fabric/network interface block 6207, a TCP/IP processor core block6202, a runtime adaptable processor (RAP) block 6206, or a combinationthereof. The adaptable TCP/IP processor may also include an adaptationcontroller block 6203, configuration memory block 6204, a memoryinterface block 6205, data buffers block 6209, a memory controller block6208, RAP interface block 6210, RAP Extension interface block 6211, or acombination thereof. The TCP/IP processor core, block 6202, is theTCP/IP processor core illustrated in FIG. 61 block 6110. As discussedearlier the security and RDMA blocks of the TCP/IP processor core may ormay not be present depending on the application and system environment.The TCP/IP processor core provides full TCP/IP protocol processing,protocol termination and protocol initiation functionality. The TCP/IPprocessor core may provide TCP/IP protocol stack comprising at least oneof the following hardware implemented functions:

a. sending and receiving data, including upper layer data;

b. establishing transport sessions and session teardown functions;

c. executing error handling functions;

d. executing time-outs;

e. executing retransmissions;

f. executing segmenting and sequencing operations;

g. maintaining protocol information regarding active transport sessions;

h. maintaining TCP/IP state information for each of one or more sessionconnections.

i. fragmenting and defragmenting data packets;

j. routing and forwarding data and control information;

k. sending to and receiving from a peer, memory regions reserved forRDMA;

l. recording said memory regions reserved for RDMA in an RDMA databaseand maintaining said database;

m. executing operations provided by RDMA capability;

n. executing security management functions;

o. executing policy management and enforcement functions;

p. executing virtualization functions;

q. communicating errors;

r. processing Layer 2 media access functions to receive and transmitdata packets, validate the packets, handle errors, communicate errorsand other Layer 2 functions;

s. processing physical layer interface functions;

t. executing TCP/IP checksum generation and verification functions;

u. processing Out of Order packets;

v. CRC calculation functions;

w. processing Direct Data Placement/Transfer;

x. Upper Layer Framing functions;

y. processing functions and interface to socket API's;

z. forming packet headers for TCP/IP for transmitted data and extractionof payload from received packets; and

aa. processing header formation and payload extraction for Layer 2protocols of data to be transmitted and received data packets;respectively.

The TCP/IP processor core may provide a transport layer RDMA capabilityas described earlier. The TCP/IP processor core may also providesecurity functions like network layer security, transport layersecurity, socket layer security, application layer security or acombination thereof besides wire speed encryption and decryptioncapabilities. Thus the TCP/IP processor core may also provide a secureTCP/IP stack in hardware with several functions described aboveimplemented in hardware. Even though the description of the adaptableTCP/IP processor has been with the TCP/IP processor core as illustratedin this application, the TCP/IP processor core may have various otherarchitectures. Beside the architecture disclosed in this patent, theTCP/IP processor core could also be a fixed function implementation, ormay be implemented as a hardware state machine or may support partialprotocol offloading capability for example support fast path processingin hardware where control plane processing as well as session managementand control may reside in a separate control plane processor or hostprocessor or a combination of various architecture alternativesdescribed above. The TCP/IP processor core architecture chosen may alsoinclude functions for security or RDMA or a combination thereof.Further, the adaptable TCP/IP processor architecture can be used forprotocols other than TCP/IP like SCTP, UDP or other transport layerprotocols by substituting the TCP/IP processor core with a protocolappropriate processor core. This would enable creating an adaptableprotocol processor targeted to the specific protocol of interest. Theruntime adaptable processor of such a processor would be able tofunction similarly to the description in this patent and offer hardwareacceleration for similar applications/services by using its dynamicadaptation capabilities.

The runtime adaptable processor, block 6206, provides a dynamicallychangeable hardware where logic and interconnect resources can beadapted programmatically on the fly to create virtual hardwareimplementations as appropriate to the need of the application/service.The adaptation controller, block 6203, may be used to dynamically updatethe RAP block. The adaptation controller may interface with the hostprocessor or control plane processor or the TCP/IP processor or acombination thereof to decide when to switch the configuration of RAPblock to create a new avatar or incarnation to support needed hardwarefunction(s), what function(s) should this avatar of RAP block perform,where to fetch the new avatar, how long is the avatar valid, when tochange the avatar, as well as provide multiple simultaneous functionsupport in the RAP block. The RAP block may be dynamically switched fromone avatar to another avatar, depending on the analysis done in TCP/IPprocessor core. For instance, the TCP/IP processor core may have aprogrammed policy that will ask it to flag any data payload receivedthat may contain XML data and pass the extracted data payload forprocessing through the RAP instead of sending it directly to the hostprocessor. In this instance, when a packet is received that contains XMLdata, the TCP/IP processor core may tag the data appropriately andeither queue the packets in the external memory for further processingby RAP or pass the data in the data buffers of block 6209 for furtherprocessing by RAP. The TCP/IP processor core may be coupled to a RAPinterface, block 6210, which may provide the functionality needed forthe TCP/IP processor core to interface with the RAP and the adaptationcontroller block. This functionality may be directly part of RAP oradaptation controller or the TCP/IP processor core. The RAP interfacewould inform the adaptation controller in this instance of the arrivalof XML traffic, so the adaptation controller can fetch the appropriateconfiguration from configuration memory, block 6204, which may beinternal or external memory or a combination thereof. The adaptationcontroller can then provide the configuration to RAP block 6206 andenable it when the XML data is ready to be operated on and is ready inthe data buffers or external memory for the RAP to fetch it. Similarly,depending on the policies that may be programmed in the TCP/IP processorcore and the received network traffic, the RAP block may get configuredinto a new hardware avatar to support the specific application, serviceor function or a combination thereof dynamically based on thecharacteristics of the received traffic and the policies.

The TCP/IP processor core may also choose to pass the received data tothe host processor without passing it for further processing through theRAP depending on the policies and/or the nature of the data received.Thus, if hardware configurations for specific operations or functions orpolicies or applications have not been realized, because the operationsor policies or functions or applications may not be used often and hencedo not cause performance issues or resources have not been assigned todevelop the acceleration support for cost reasons or any other businessor other reasons, those operations may be performed on the host. Asthose operations are realized as a runtime adaptable configuration, itmay be provided to the adaptation controller so it can configure the RAPblock for that operation as needed dynamically. The TCP/IP processorwould also be informed to then identify such operations and pass themthrough RAP block. Using such a technique, over a period of time moreapplications can be accelerated without the need for changing or addingany hardware accelerators. The deployment of new policies orapplications or services or operations on the runtime adaptableprocessor may be under the user or administrator control using verysimilar mechanisms as those shown for the security policy deployment andmanagement. Thus, a central administrator can efficiently deploy newconfigurations to systems with the runtime adaptable protocol processorof this patent as and when needed. Similarly, the user or theadministrator could remove or change the RAP supported functions orpolicies or applications as the need or the usage of the system changes.For example, a system using runtime adaptable protocol processor of thispatent may initially be used for XML traffic, however its usage maychange to support voice over IP application and XML acceleration may notbe required, but instead some other voice over IP acceleration isneeded. In such an instance the user or the administrator may be able tochange, add or remove selectable hardware supported configurations fromthe specific system or systems. The central manager/policy server flow,central manager flow, rule distribution flow, control planeprocessor/policy driver flows and the like illustrated in FIG. 56through FIG. 59 are applicable to the management, deployment, change,monitoring and the like for the runtime adaptable configurations as wellwith appropriate changes similar to those explained as follows. Theruntime adaptable configuration creation flow may be added to thesecurity policy creation flow for example. New configurations may becomeavailable from another vendor and the user may just need to select theconfiguration of interest to be deployed. The configuration distributionflow may be similar to the rule distribution flow, where the policiesfor the support of the configuration(s) may be distributed to the TCP/IPprocessor core blocks, where as the configuration may be distributed tothe adaptation controller of the system of interest or to a driver or anconfiguration control process on the host system or a combinationthereof. Thus the runtime adaptable protocol processor systems may beintegrated well into other enterprise management systems when used inthat environment. The application or policy or service or operationconfigurations may be distributed by other means for example as asoftware update over the network or through mass storage devices orother means. The foregoing description is one way of providing theupdates in one usage environment but there can multiple other ways to dothe same for each embodiment and the usage environment as one skilled inthe art can appreciate and hence should not be viewed as limited to thedescription above.

The adaptation controller may also be required to configure the RAPblock to operate on data being sent out to the network. In such a casethe RAP block may be required to operate on the data before it is passedon to TCP/IP processor core to send it to the intended recipient overthe network. For example, it may be necessary to perform secure socketlayer (SSL) operations on the data before being encapsulated in thetransport and network headers by the TCP/IP processor core. The hostdriver or the application that is sending this data would inform theadaptation controller of the operation to be performed on the databefore being passed on to the TCP/IP processor core. This can happenthrough the direct path from the host/fabric interface 6207 to theadaptation controller 6203. The adaptation controller can then configureRAP block 6206 or a part of it to perform the operation requesteddynamically and let RAP operate on the data. Once RAP operation iscompleted it can inform the adaptation controller of the operationcompletion, which can then work with the TCP/IP processor core to sendthis data enroute to its destination after appropriate protocolprocessing, header encapsulation and the like by the TCP/IP protocolprocessor. RAP 6206 may pass the processed data to the TCP/IP processorcore through data buffers of block 6209 or by queuing them in memoryusing the memory interface block 6205. Thus the runtime adaptable TCP/IPprocessor of this patent can be configured to operate on incoming aswell as outgoing data, before or after processing by the TCP/IPprocessor core.

Runtime adaptable processor 6206 may be restricted in size orcapabilities by physical, cost, performance, power or other constraints.RAP extension interface, block 6211, may also be provided on theadaptable TCP/IP processor to interface RAP block 6206 to one or moreexternal components providing runtime adaptable processor functionality.Thus the solution can be scaled to bigger size or features orcapabilities using the RAP extension interface 6211. RAP extensioninterface comprises of all the necessary control, routing, data, memoryinterface buses and connections as needed to seamlessly extend the RAPinto one or more external components.

FIG. 63 illustrates an adaptable TCP/IP processor alternative of thispatent to that described above. As indicated earlier, the TCP/IPprocessor portion of this processor may not only be the architecturedisclosed in this patent but may also be a fixed functionimplementation, or may be implemented as a hardware state machine or maysupport partial protocol offloading capability, for example support fastpath processing in hardware where control plane processing as well assession management and control may reside in a separate control planeprocessor or host processor or a combination of various architecturealternatives described above. The TCP/IP processor core architecturechosen may also include functions for security or RDMA or a combinationthereof Further, the adaptable TCP/IP processor architecture can be usedfor protocols other than TCP/IP like SCTP, UDP or other transport layerprotocols by substituting the TCP/IP processor core with a protocolappropriate processor core. The adaptable TCP/IP processor alternate ofFIG. 63 illustrates the runtime adaptable processor, block 6311, theadaptation controller, block 6310, and the configuration memory, block6312, as integrated more tightly in the TCP/IP processor architecture tocreate a runtime adaptable TCP/IP processor. The functions provided byRAP, block 6311, adaptation controller, block 6310, and theconfiguration memory, block 6312, is very similar to that of thecorresponding blocks in FIG. 62. The RAP interface functionality ofblock 6210, or the memory interface block 6205, or data buffers, block6209, may be appropriately provided by blocks 6310, 6311 or 6312 or acombination thereof. It may also be distributed within the TCP/IPprocessor elements. This architecture may also provide a RAP extensioninterface like that of block 6211 to provide RAP scalability, eventhough such a block is not shown in FIG. 63. This version of theadaptable TCP/IP processor would also operate similar to that in FIG. 62and can also be configured to operate on incoming as well as outgoingdata, before or after processing by the TCP/IP processor core blocks.

FIG. 64 illustrates a runtime adaptable processor of this patent. Theruntime adaptable processor comprises computational logic andinterconnect resource that can be dynamically changed to map varioushardware functions that need to be accelerated. The computational logicblocks may be realized using FPGA like combinational blocks for finegrain control or may consist of one or more simple programmableprocessor(s), ALU, and memory that can be configured to provide specifichardware function(s) at a given time which may then be changeddynamically to support a new function. The dynamic change of thefunction can be done by a configuration controller as needed by theusage of RAP. For example the computational block(s) may be setup toprovide addition operation for a selected time period on incoming datato the computational block, but then as a new avatar is created theoperation provided may be selected to be subtraction for the duration ofthe new avatar. The selection of the new operation for the new avatarmay be done by the appropriate configuration controller. Thus thefunction provided by a computational block can be dynamically changed toanother function, while some other computational blocks may continue toprovide their selected operation. The computational block functionchange may take a certain period of time, which may be as low as a clockperiod or multiple clock periods or other period. The dynamic adaptationof one or more computational blocks may be done simultaneously orotherwise as needed. The interconnect resources may also be realizedsimilarly to that of reconfigurable routing resources of FPGAs. FIG. 64illustrates a runtime adaptable processor architecture of this patent asa hierarchy of computational logic, called compute clusters, blocks6401(1) through 6401(Z), interconnected using a routing networkcomprised of routing resources 6407(a)-through 6407(n). These routingresources are interconnected using the inter cluster routing switch,6403. The inter cluster routing switch may be configured dynamically toprovide highly programmable interconnections between various computeclusters thereby creating changing avatars of the hardware. The computeclusters may be configured individually by the global configurationcontroller, block 6405, which works with the adaptation controller,block 6203 of FIG. 62, to dynamically adapt the RAP. The globalconfiguration controller works with configuration memory controller,block 6406 and configuration memory block 6204 and the adaptationcontroller 6203, both of FIG. 62, to retrieve configuration informationwhich is used to dynamically change the individual compute clusterconfigurations and the inter cluster routing switch for interconnectconfigurations. Input/Output interface and controller, block 6404, isused to interface the runtime adaptable processor with adaptationcontroller, block 6203, data buffers, block 6209, RAP extensioninterface, block 6211 or the host/fabric/network interface, block 6207,all of FIG. 62. The global memory and controller, block 6402, providesglobal memory to compute clusters and also provides a controller tointerface with external memory interface block 6205. Computational logicinside compute clusters 6401(1)-6401(Z) may need memory beside thatinside each cluster. The global memory block and controller can fulfillthis need. The figure illustrates multiple interconnection elements thatserve different roles. Interconnect channels 6407(a) through 6407(n),are the routing channels to connect each cluster to the inter clusterrouting switch to enable a multi-way connection capability for eachcluster to source or sink information from other clusters of the RAP.Interconnect channels 6408(a) through 6408(n) provide memoryinterconnect resources for the compute clusters for them to get accessto the global memory. These memory channels may be shared among theclusters in a column or there may be multiple paths to the memory whichmay be used simultaneously by many compute clusters in a column to writeor read data to or from the memory. Interconnect channels 6409(a)through 6409(m) are the configuration channels that allow the globalconfiguration controller to send configuration information to thecompute clusters and receive event information or other information fromcompute clusters that may be used to change the configuration of a givencluster or some of the clusters or all RAP cluster configurations. Theinterconnect channel architecture and implementation for the above maybe accomplished using wide busses, high speed serial interconnects orother implementation choices. The specific choice or topology is notdictated or implied by the figure.

The runtime adaptable processor of FIG. 64 can be configured such thatcomputation array may be split into partial regions, where each regionmay be configured to perform a specific hardware operation. For exampleclusters 6401(1), 6401(2) may form-one region whereas clusters 6401(3)through 6401(M) may form another region and some other clusters may formyet another region. Some of the regions may be interconnected aspipelined stages as may be required by the hardware function beingmapped onto the runtime adaptable processor. Regions of the mapping mayinterconnect with each other or may operate on independent data orstreams as may be appropriate for the operations mapped. The regions canall be dynamically adapted with the changing needs of the processingrequirements. The regions can be very granular or may involve onlypartial compute clusters as well. Hence the runtime adaptable processorof this patent is dynamically adaptable to a very fine grain level tomeet the demands of the required processing.

FIG. 65 illustrates a compute cluster of this patent. The computecluster comprises computational elements (CE), blocks 6501(1) through6501(Z), that provide computational logic. CEs may be composed ofFPGA-like combinational logic blocks and interconnect resources or maybe simple programmable processors with ALU and memory which provides agiven hardware function based on the instructions programmed CEs may bedynamically configured by changing the instruction being executed on theinput data or stored data or combination thereof to perform a newhardware function. The processors may be simple processors supportinghard wired instructions through combinational logic that can selectrequired hardware operation configured in the combinational logic. Theprocessors may be more complex processors where the hardwareconfiguration may select the instruction that is executed through theALU and other functional resources providing a virtual hardwareavatar/incarnation/configuration. The avatar may also comprise multipleinstructions being executed through the resources forming a more complexconfiguration. Multiple avatars may be programmed in the CE, and aspecific avatar can be dynamically selected, providing very flexiblehardware architecture. The CEs may provide bit-wise operations, as wellas operations on groups of bits like 4-bit, 8-bit, 16-bit groupings asdesired by the granularity of the configuration options. This bitgroupings may be an implementation choice where larger or smallergroupings can be selected without deviating from the principles of theteachings of this patent. The cluster configuration controller, block6507, interacts with the global configuration controller, block 6405 ofFIG. 64, to select the specific avatar for each CE. The interconnectchannels, 6502(1) through 6502(N), provide the configuration informationfrom the configuration controller to the CEs and any executioninformation or events or avatar change requests or a combination thereoffrom the CEs to the cluster configuration controller. This informationmay be used to direct the flow of the configuration controller mappingdifferent avatars in conjunction with the global configurationcontroller 6405 and/or the adaptation controller 6203 of FIG. 62. Thisinformation may be used to create pipeline stages of operation wheredifferent portions of compute cluster or compute clusters providemultiple stages of operations. Interconnect channels, 6503(1) through6503(M) provide connectivity to cluster memory and controller, block6506, to the CEs. There may be multiple parallel paths into the clustermemory, thereby allowing multiple simultaneous accesses to differentregions of the memory as indicated. CEs on a given channel may all sharethe channel or there may be multiple paths per channel to the memory aswell. The cluster memory may be a single memory array or may be multiplememory arrays as an implementation choice. The cluster memory is alsocoupled to the global memory and controller, block 6402 of FIG. 64,through channels like 6508 and 6408(a) through 6408(n) of FIG. 64. Theglobal memory and cluster memory may each be accessible from thehost/fabric/network interface, 6207, or the adaptation controller, 6203,or memory interface, 6205 or a combination thereof to read or writememory locations individually or as a set of locations forinitialization or other purposes like DMA access, test, or the like. TheCEs may also provide connectivity to their next neighbor as indicated inFIG. 65 by the arrows. Not all neighbor connections indicated have to bepresent. This can be an implementation choice. These connections allowCEs to send or receive output or input data, flags, exception conditionsor like information or a combination thereof to their neighbors. Theavatar selected for the CE would decide which inputs to use to retrievethe needed information to operate on. The outputs from CEs may also beselected to be coupled to the cluster routing switch, 6506, which canthen provide selected connectivity between CEs within the cluster, aswell as provide connectivity with other compute clusters by couplingwith the inter cluster routing switch, 6403, and the interconnectchannels like 6407(a) through 6407(n). The cluster routing switch may beconfigured for the appropriate interconnections through the clusterconfiguration controller, 6507, by coupling with interconnect channel,6504.

FIG. 66 illustrates a security solution using the teachings of thispatent. The security solution comprises a central manager, a network,one or more line cards and secure chips. The central manager is acollection of functional modules that reside in a central managementsystem used by IT manager(s) to create, deploy and monitor securityrules. Central manager modules are similar to those of the centralmanager shown in FIG. 56 and both are used interchangeably in thefollowing description. These modules may reside on the same set ofsystems that are used for managing the overall network or may be onindependent systems. Block 6601 is illustrated to represent networkmanagement applications and security applications that may be deployedfor a network. These applications are used by the IT managers to createtheir security rules or policies. The central manager provides anapplication programmer interface (API), block 6602, which provides auniform interface to security and management applications of 6601, touse the distributed security system of this patent. The API interfaceswith network layer rules engine, block 6603, application layer rulesengine, block 6604, storage area network rules engine, block 6605 orother application specific rule engines, block 6619, or a combinationthereof These rule engines provide API support functions for one or morespecific categories of the rules that they represent. They may alsoprovide rule templates that are preconfigured that an IT manager can useby filling in relevant fields of the rules for their specific needs. Forinstance there may be a set of rules that deny connection requests toall users whose network address is not a local address, or deny requeststo specific ports like port 80 for all outside connections or the likeor a combination thereof. These rules engines assemble the rules andprovide them to the rules compiler, block 6606, for compiling them fordistribution to secure nodes. The compiler uses nodes capability andconnectivity database, block 6617, to compile node appropriate rules andactions. The compiled rules are deposited in compiled rules database,block 6618. The rules distribution engine, block 6607, distributes therules to the appropriate nodes using a central manager flow and rulesdistribution flow similar to that illustrated in FIG. 57 and FIG. 58.The security rules may be distributed to the host processor or a controlplane processor as illustrated in FIG. 58 or to a control processor andscheduler, block 7103, described below, or a combination thereof asappropriate depending on the node capability. The rules may bedistributed using a secure link or insecure link using proprietary orstandard protocols as appropriate per the specific node's capabilityover a network. The network may be a local area network (LAN), wide areanetwork (WAN), metro area network (MAN), wireless LAN, storage areanetwork (SAN) or a system area network or another network type deployedor a combination thereof. The network may be Ethernet based, internesprotocol based or SONET based or other protocol based or a combinationthereof. Monitoring interface, block 6609, and Event recording engineand database, block 6608, are utilized to collect various securityand/or management events from various nodes that are being monitored forsecurity violations or other conditions as defined by the rules. Theseblocks represent the central manager blocks 5613, 5614, 5615 and 5616described above and provide similar functionality. The monitoring enginemay provide the analysis capability as described for block 5613 or maywork with analysis and reporting application(s) illustrated by block6610 to provide intelligent reports to the IT manager of securityviolations or breaches or conformance or other issues upon request orautomatically as programmed by the IT manager depending on the nature ofthe issue and its severity. The central manager modules may also bedeployed local to a network node system, for example a switch or arouter, and work within the system's control and management software. Itmay be used to deploy and monitor rules local to various line card(s) oraccelerator cards or other cards providing security capability of thesystem. In such an instance, the network used to communicate the rulesmay be a local bus or a system area network, or a combination thereof,of the specific system.

Security Solution comprises line cards which may incorporate thesecurity processor, SAN protocol processor, TCP/IP processor or runtimeadaptable protocol processor or various other processors disclosed inthis patent. The line card configuration and the architecture may varywith the specific system and the application. Three types of line cardarchitectures, a) flow-through b) look-aside and c) accelerator card,are illustrated in this patent to illustrate usage models for theprocessors of this patent. FIG. 68, FIG. 69 and FIG. 70 illustrate theseconfigurations using a security processor based system, though it couldalso be based on other processors of this patent. Blocks 6612 and block6613 illustrate two of these types of card configurations. The securityprocessor illustrated in these cards is that disclosed in this patent.There are various different variations of the security processor thatcan be created depending on the functionality incorporated in theprocessor. Blocks 6614 and block 6615 illustrate two versions of suchsecurity processor. Block 6614 illustrates a security processor corecomprising at least a content search and rule processing engine coupledwith a runtime adaptable processor. This processor is similar to thatillustrated in FIG. 71 and is described in detail below. Block 6615,illustrates the security processor of block 6614 coupled with a TCP/IPprocessor or a protocol processor to provide more functionality usablein a security node as a security processor. A reduced functionalitysecurity processor, not illustrated, may also be created by removingruntime adaptable processor and associated logic from block 6614 toprovide a content search and rules processing engine based securityprocessor. The choice of the security processor may depend on the systemin which it is being deployed, the functionality supported by thesystem, the solution cost, performance requirement, or other reasons, ora combination thereof The security processor may use one or more portsto connect to external memories, block 6616, which may be used to storerules information, or other intermediate data or packets or otherinformation as necessary to perform various functions needed forsecurity processing. The memories may be of various types like DRAM,SDRAM, DDR DRAM, SRAM, RDRAM, FCRAM, QDR SRAM, DDR SRAM, Magneticmemories, Flash or a combination thereof or future derivates of suchmemory technologies. The inventions disclosed in this patent enable manyvariations of the architectures illustrated and may be appreciated bythose skilled in the art that changes in the embodiments may be madewithout departing from the principles and spirit of the invention.

FIG. 67 illustrates security solution compiler flow. As described abovesecurity rules may be of various types like application layer rules,block 6701, network layer rules, block 6702, Storage area network rules,block 6703, or application specific rules, block 6619, or a combinationthereof As illustrated in FIG. 67 application layer rules comprise basicstring search rules that may be expressed in a special language or astandard representation like regular expressions or a combinationthereof Application layer rules which typically require searchingcharacter strings deep inside a packet may be represented using aregular expression. The types of application layer rules or networklayer rules or SAN rules or application specific rules may vary with thespecific node where they may be deployed, the organization or the entityusing them, the security threats being defended against or otherpurposes or a combination thereof The figure illustrates variouscategories of rules that may be created depending on the usage model.These rules may be created by anti-spam software vendors or the entityusing the security system or vendor supplying the security solution orother third parties or a combination thereof There may be applicationlayer rules to defend against spam. This may comprise of rules that havebeen created using the knowledge of spam or unwanted messages. Forinstance a rule may be to search for a message like “receive millionUSD” inside any incoming email anywhere within the email including theheader and the message. Such a rule may be represented using a regularexpression like “.*receive million [U|u][s|S][D|d]” which will detectthe message of interest i.e “receive million USD” but may also detectvariations of this where USD is not all capitals e.g. usd or UsD or usDor the like. The leading “.*” in this rule indicates to search for themessage anywhere within the received data packets. A set of rules may bedefined like the one above to form the anti-spam rule set. These rulesmay be updated as new types of spam or methods are discovered and can bekept up to date with constantly evolving threats. Similarly a set ofrules may be developed to perform virus scan functions to detect forvarious known viruses. Anti-virus rules are typically signature matchingor pattern matching rules similar to those discussed above. The virussignatures may be looked for at specific locations in a message or afile and may be described using a similar method. A set of anti-virusrules are defined from known virus signatures to detect for knownviruses. As new viruses or worms become known, the anti-virus rules maybe updated to defend against them as well. These rules would then becompiled through the security compiler flow and distributed to all nodesof interest as discussed earlier. Once these rules get deployed, thesecurity nodes may be programmed to take action corresponding to thematch on an anti-spam or anti-virus rule to deny access to theparticular node originating the message or, drop the connection or flagthe session to the IT manager or other appropriate action as defined bythe rule. The central manager modules provide the ability for the ITmanager to define such actions when certain conditions like those aboveare met. The actions may be comprised of one or more of drop connection,deny access, report violation, page the network manager, allow accessbut record the violation for later analysis or isolate the source nodeto a specific virtual LAN or transfer the connection to some other nodeor other similar action as appropriate.

Similarly there may be many other categories of application layer rules.For example there may be rules defined to manage digital rights of theowners of the electronic documents or media or content as may beappropriate. Such rules may also be defined similar to the signature orpattern matching or string of character matching rules above. Theserules may flag matches to a specific digital rights signature inside acontent, which can then be used to refer to a digital rights databasethat may indicate if such an access or usage of the digital content ispermitted to the owner. The digital rights ownership data base mayreside in the memories associated with the security processor and acontrol processor, like block 6809 or block 7103, described below, canrefer to that database to decide if valid ownership exists or not and ifit does not exist what specific action should be taken based on thedefined rule. The digital rights confirmation may be done by some otherdevice or processor in the specific node which is performing the digitalrights signature matching. The decision of where to perform suchanalysis functionality may depend on the specific system usage model andthe system design choices. A set of rules for digital rights managementalso be created as part of the application layer rules for the securityprocessor.

Instant messaging (IM) has gained tremendous success in its usage byindividuals as well as corporations. Instant messaging may be regulatedfor various industries like the financial industry to preserve forfuture reference and is also subject to spam like other modes ofcommunication like email. Thus some organizations may create rulesspecifically targeted towards instant messaging to protect againstensuing liabilities in case of wrongful usage or protect the users fromunwanted spam or for other reasons as deemed appropriate by theorganization. One of the issues with instant messaging is that any levelof policing has to be done in stream without creating delays in thecommunication. Thus a hardware based security enforcement of this patentmay be needed to monitor IM. These rules are similar to otherapplication layer rules discussed above and may be created using similarmeans like defining the message search strings using regularexpressions.

Recent surveys by FBI and others have found that over 70% of attacks oninformation technology are from within an organization. Thus there is aneed for a class of security devices and rules that need to be developedto protect from the damaging effects of such attacks. These rules aredefined as extrusion detection rules. The extrusion detection rules maybe created to detect intentional or unintentional disclosure ofconfidential or proprietary or sensitive information of the organizationusing the network from going outside the perimeter of the organization.For example a software company may need to guard its core softwaresource code from accidental or malicious disclosure to people orentities unauthorized to get it. A set of rules may thus be created bythe organization that may search for specific strings or paragraphs orcode modules or other appropriate information within all outboundmessages and flag them or prevent them from being sent. Such rules mayalso be compiled using the security compiler flow and distributed to theappropriate node or nodes. For example a rule may be defined to searchfor a “Top Secret” phrase in any message being sent that is outboundfrom the organization and flag such a message for further review by theIT manager or to drop such connection and inform the user or otherresponsible person. A regular expression rule “.*Top Secret” may bedefined to search for the term anywhere in a message. Such rules mayalso be created as application layer rules that may then be compiled anddistributed to appropriate nodes for detection and enforcement ofextrusion detection security functionality.

The IT manager may be able to create classes of rules from theapplication layer rules or network layer rules or SAN rules orapplication specific rules or other rules and deploy a class of rules toa class of security nodes and a different class of rules to another setof security nodes. For example the manager can create certainapplication layer rules like anti-spam or anti-virus rules and networklayer rules that are deployed to the switches and routers of the networkthat are security enabled with the teaching of this patent and anotherset of rules like extrusion detection rules and network layer rules forsensitive servers holding critical top secret information. It may bepossible to create different sets of rules that may be deployeddepending on the functions within an organization. For example, securitynodes that are deployed within a manufacturing department may get oneset of rules while those in an engineering department may get adifferent set of rules. Creating a different set of rules for differenttypes of devices or different device users or node specific rules or acombination thereof can be used as a process to create a pervasive andlayered security within an organization.

Similarly there may be application layer rules that detect or flagaccess to specific web address or URL's or other confidentialinformation like customer information comprising their credit cardnumbers, or health information or financial reports or the like, whichmay be used to create a different set of application rules as shown inblock 6701. With an increase in usage of voice over IP solutions withinorganizations and over the internet, security threats are alsoincreasing. It may then be necessary to create rules specific to VOIP,for example rogue connections may need to be detected and flagged orVOIP traffic may not be allowed to go outside an organization's boundaryor detect for viruses entering the organization through VOIP connectionsor create confidentiality of VOIP traffic by encrypting it or the like.The VOIP rules may also be created using the same application layerrules engines and detect matches to the rules at appropriate nodes inthe network. The runtime adaptable processor, block 7102, describedbelow, may be used to provide encryption or decryption services to VOIPtraffic when such traffic is detected by the VOIP rule match. Similarly,other application specific rules may also be developed and provided inthe central manager modules to be programmed, compiled and distributedto the secure nodes in the network using the compiler flow illustratedin FIG. 67.

Network layer rules, block 6702, may comprise various rules targeted atthe network and transport layers of the network. These rules are similarto those illustrated in FIG. 60. These rules may include IP leveladdress rules, protocol port rules, protocol specific rules, connectiondirection oriented rules, and the like. These rules may be described ina special language or using regular expressions. In TCP/IP basednetworks these are primarily TCP and IP header fields based rules, wherematches may be defined on source address or destination address or anaddress range or port numbers or protocol type or a combination thereofSimilarly there may be rules targeted specifically to storage areanetworks which may transport critical information assets of anorganization. This is shown as a different category of rules, but maycomprise storage network's network layer rules, application layer rulesor the like. There may be rules targeted to specific logical unitnumbers (LUNs) or zones (groups of source/destination addresses) orlogical or physical block addresses or the like. These rules may also berepresented in a specific language or as strings of characters or datapatterns using regular expressions.

The secure solution compiler of FIG. 67 allows an IT manager to createsecurity rules of different types as discussed above and enable them tocreate a layered and/or pervasive security model. The compiler flowwould be provided with the characteristics of the specific nodes likethe security capability presence, the rules communication method, thesize of the rule base supported, the performance metrics of the node,deployment location e.g. LAN or SAN or other, or the like. The compilerflow then uses this knowledge to compile node specific rules from therule set(s) created by the IT manager. The compiler comprises a rulesparser, block 6704, for parsing the rules to be presented to the lexicalanalyzer generator, block 6705, which analyzes the rules and createsrules database used for analyzing the content. The rule parser may readthe rules from files of rules or directly from the command line or acombination depending on the output of the rule engines. The rules for aspecific node are parsed to recognize the language specific tokens usedto describe the rules or regular expression tokens. The parser thenpresents the tokens to the lexical analyzer generator. The lexicalanalyzer processes the incoming tokens and generates non-deterministicfinite automaton (NFA) which represents rules for parsing the content.The NFA is then converted in deterministic finite automaton (DFA) by thelexical analyzer generator to enable deterministic processing of therule states. The process of creating NFAs and DFAs is well understood bymodern compiler developers. However, the lexical analyzer generatorcreates various tables that represent DFA states and the statetransition tables for the rules that are used by a hardware lexicalanalyzer instead of generating lexical analysis software as is done forcompilers. One way to view the rules is that they define a language torecognize the content. These tables are used by a lexical analyzerhardware or content search and rule processing engine, block 7106,described below, to analyze the stream of data being presented to thesecurity processor of this patent. The regular expression rules can beviewed as defining a state transition table. For example, if a string“help” is being searched, using a regular expression “help”, then eachcharacter of the regular expression can be viewed to represent a state.There may be a start state s0, and character specific states s1(h),s2(e), s3(l), and s4(p) where s(x) represent a state for a character x.There may also be error states like s_err which may be entered uponterminating a search when appropriate transition conditions are not met.As the input stream is being analyzed by the hardware lexical analyzerthis state machine is activated when a first “h” is encountered, and thestate machine reaches s1. Now if the next character in the stream is an“e” then the state machine transitions to s2. Thus if a string “help” isencountered the state machine will reach state s4. States s1 through s3are accepting states, meaning they continue the search to the nextstate. State s4, for this string is marked by the lexical analyzergenerator as a terminal state. These states are marked as accepting orterminal states in the accept tables. When a comparison reaches aterminal state, a match with the specific rule may be indicated. Anyaction that needs to be taken based on matching of a rule is created ina match/action table as an action tag or instruction that is then usedby the content search and rule processing engine, block 7106, to takespecific action or forward the match and action information to controlprocessor, block 7103, to take appropriate rule specific action.However, if there is only a partial rule match e.g. if the input contentincludes string “her”, then the rule processing hardware will enterstate s2, having encountered “he” however, as soon as “r” is analyzed,an error is indicated to mean that there is no rule match and processingof the input stream starts from that point forward from the initialstate s0. Though the above description is given with regards to usingsingle character match per state, it is be possible to analyze multiplecharacters at the same time to speed up the hardware analysis. Forexample, the lexical analyzer generator may create tables that enabletransition of 4 characters per state there by quadrupling the contentsearch speed. The lexical analyzer generator creates character classtables, block 6706, next state look-up tables, block 6709, statetransition tables, block 6707, accept states, block 6708 andmatch/action tables, block 6710 which are then stored in the compiledrules database storage, block 6711. The character class tables arecreated by compressing the characters that create a similar set of statetransition into a group of states for compact representation. The statetransition tables comprise of rows of states in a DFA table withcompressed character class as the columns to look-up the next statetransitions. The next state table are used to index to the next statefrom the current state in the state machine represented by the DFA.These tables are stored in on-chip and off-chip memories associated withsecurity processors of this patent. The compiler of this patent uses thenode characteristics and connectivity database to create the rules on anode by node basis. The compiler indicates an error to the IT manager ifcertain rules or rule sizes do not match the capabilities of thespecific nodes so they may be corrected by the manager. This informationis retrieved from a node characteristics and connectivity database asillustrated by block 6713.

Rules distribution engine, block 6712, follows the central manager andrules distribution flow illustrated in FIG. 57 and FIG. 58. The securityrules may be distributed to the host processor or a control planeprocessor as illustrated in FIG. 58 or to a control processor andscheduler, block 7103, described below, or a combination thereof asappropriate depending on the node capability. The rules may bedistributed using a secure link or insecure link using proprietary orstandard protocols as appropriate per the specific node's capabilityover a network.

FIG. 71 illustrates a security processor of this patent. The securityprocessor comprises a coprocessor or host bus interface, block 7101, acontrol processor and scheduler, block 7103, at least one content searchand rules processing engine, block 7106, next state memory, block 7110,match/action table memory, block 7111, character class table memory,block 7107, and accept and state transition memories, block 7108. Thesecurity processor may also comprise of packet buffers, block 7104,memory controller, block 7112, run time adaptable processor, block 7102,adaptation controller, block 7105 and configuration memory, block 7109.A version of security processor may be created by using coprocessor orhost interface controller acting as a data interface, a controlprocessor and scheduler, at least one content search and rulesprocessing engine, next state memory, match/action table memory,character class table memory, accept and state transition memories andmemory controller. Memory controller may not be required in systemapplications where the number of rules is small enough to fit in the onchip memories. Such a processor may perform all the content searchtasks; however it may not be able to provide targeted applicationacceleration, which may be feasible with a security processor thatincludes a run time adaptable processor.

The control processor and scheduler, block 7103, communicates with therules distribution engine, block 6712 to receive appropriate data tablesprior to starting the content inspection. It stores the received stateinformation into their respective dedicated memories. The characterclass table from block 6706, is stored in the memory block 7107. Thestate transition and accept tables, block 6707 and 6708, are stored intheir respective memories represented by block 7108. Block 7108 may alsobe two or more separate memories for performance reasons but areillustrated by one block in the figures. The next state look-up tablesfrom block 6709 are stored in the next state memory, block 7110. Thematch/action tables from block 6710 are stored in their memory block7111. These tables may be larger than the memory available in thesecurity processor on-chip, and may be stored in external memory ormemories that are accessed by the memory controller block 7112. Theremay be multiple ports to memory to speed up access to data tables storedin external memories. These memories may be of various types like DRAM,SDRAM, DDR DRAM, SRAM, RDRAM, FCRAM, QDR SRAM, DDR SRAM, Magneticmemories, Flash or a combination thereof or future derivatives of suchmemory technologies. For most applications next state table and actiontables may need to be off-chip, whereas the other tables may bemaintained on chip dependent on the size and number of the rules. Oncethe rules distribution engine provides the tables to the controlprocessor and scheduler, block 7103, and they are setup in theirrespective memories, the security processor is ready to start processingthe data stream to perform content inspection and identify potentialsecurity rule matches or violations. The security processor stateconfiguration information is received via a coprocessor/host interfacecontroller. The security processor of this patent may be deployed invarious configurations like a look-aside configuration illustrated inFIG. 69 or flow-through configuration illustrated in FIG. 68 or anaccelerator adapter configuration illustrated in FIG. 70 as well othersnot illustrated which can be appreciated by persons skilled in the art.In a look-aside or an accelerator adapter configuration, the securityprocessor of this patent is under control of a master processor whichmay be a network processor or a switch processor or a TCP/IP processoror classification processor or forwarding processor or a host processoror the like depending on the system in which such a card would reside.The control processor and scheduler receives the configurationinformation under the control of such master processor that communicateswith the rule engine to receive packets that contain the configurationinformation and passes it on to the security processor. Once theconfiguration is done, the master processor provides packets to thesecurity processor for which content inspection needs to be performedusing the coprocessor or host interface. The coprocessor or the hostinterface may be standard buses like PCI, PCI-X, PCI express, RapidIO,HyperTransport or LA-1 or SRAM memory interface or the like or aproprietary bus. The bandwidth on the bus should be sufficient to keepthe content search engine operating at its peak line rate. The securityprocessor may be a memory mapped or an IO mapped device in the masterprocessor space for it to receive the packets and other configurationinformation in a look-aside or accelerator configuration. The securityprocessor may be polled by the master processor or may provide adoorbell or interrupt mechanism to the master to indicate when it isdone with a given packet or when it finds a match to the programmedrules. The control processor and scheduler, block 7103 and the block7101 work with the master processor to provide the above functionality.The control processor and scheduler stores incoming packets to thepacket buffer, block 7104, and schedules the packets for processing bythe content search and rule processing engines as they become availableto analyze the content. The scheduler maintains the record of thepackets being processed by the specific engines and once the packets areprocessed it informs the master processor. The content search and ruleprocessing engines of block 7106 inform the control processor and thescheduler when they have found a match to a rule and the actionassociated with that rule as programmed in the match/action table. Thisinformation may in turn be sent by the control processor to the masterprocessor, where the master processor can take specific action for thepacket indicated by the rule. The actions may be one from a multitude ofactions like dropping the packet or dropping a connection or informingthe IT manager, or the like, as discussed earlier. When the securityprocessor includes a runtime adaptable processor like block 7102, thecontrol processor and scheduler may schedule operations on the packetthrough block 7102. The control processor would work with the adaptationcontroller, block 7105, to select the specific avatar of the processorfor the needed operation. For example, a packet that needs to bedecrypted before being analyzed may be scheduled to the adaptableprocessor before being analyzed by the content search engines. Once thepacket has been decrypted by the adaptable processor it is thenscheduled by block 7103 to block 7106. However, runtime the adaptableprocessor may operate on a packet once a match has been found by thecontent search engines or the packet has been processed by the searchengine without any issues. For example, the packet data may need to beencrypted once no issues have been found. The control processor andscheduler schedules the packets to the runtime adaptable processor inthe appropriate order as defined by the needs of the operation. Theruntime adaptable processor, block 7102, adaptation controller, block7105 and configuration memory, block 7109 is similar to thoseillustrated in FIGS. 62, 63, 64 and 65. The runtime adaptable processorand the associated block provide similar functionality with appropriatelogic enhancements made to couple to the control processor and schedulerof the security processor. The runtime adaptable processor may be usedto provide compression and decompression service to the packets if theappropriate adaptation configurations are deployed. The runtimeadaptable processor may also be used for VOIP packets providing relevanthardware acceleration service to those packets like DSP processing orencryption or decryption or the like.

The security processor may also need to provide inspection abilityacross multiple packets in a connection between a source and adestination. The control processor and scheduler, block 7103, providessuch functionality as well. The control processor may store the internalprocessing state of the content search and security processing engine ina connection database which may be maintained in the on chip memory inthe control processor or in the off-chip memory. The control processorand scheduler looks up the execution or analysis state for a givenconnection when a packet corresponding to the connection is presented toit by the master processor or in the incoming traffic in a flow-throughconfiguration described below. The connection ID may be created by themaster processor and provided that to the security processor with thepacket to be inspected or the security device may derive the connectionassociation from the header of the packet. The connection ID may becreated in the IP protocol case by using a 5-tuple hashing derived fromthe source address, destination address, source port, destination portand the protocol type. Once the connection ID is created and resolved incase of a hash conflict by the control processor and scheduler, it thenretrieves the state associated with that connection and provides thestate to the search engines, block 7106, to start searching from thatstate. This mechanism is used to create multi-packet searches perconnection and detect any security violations or threats that spanpacket boundaries. For example, if there is a rule defined to search for“Million US Dollars” and if this string appears in a connection datatransfer in two separate packets where “Million U” appears in one packetand “S Dollars” appears in another packet then if a connection basedmulti-packet search mechanism of this patent is not present the securityviolation may not be detected since each packet individually does notmatch the rule. However, when the multi-packet search is performed, nomatter how far apart in time these two packets arrive at the securitynode, the state of the search will be maintained from one packet toanother for the connection and the strings of two packets will bedetected and flagged as a continuous string “Million US Dollars”.

As discussed earlier the security processor of this patent may also bedeployed in a flow-through configuration. For such a configuration thesecurity processor may include two sets of media interface controllerports as illustrated by blocks 7201 and 7213. The security processorillustrated in FIG. 72 is very similar to that in FIG. 71; however ithas multiple media interface controller ports as against the host orcoprocessor interface block like block 7101. The number of ports maydepend on the line rate per port and the performance of the securityprocessor. The sum of incoming ports line rate should be matched withthe processing performance of the security processor to provide securityinspection to substantially the entire incoming traffic. A consciouschoice could be made to use a higher line rate sum than the processorscapability if it is known that not all the traffic needs to be inspectedfor security purposes. The decision of the traffic that must beinspected may depend on the connection or the session as programmed inthe processor from the central manager. The security processor of FIG.72 may thus be used to provide flow-through security inspection to thetraffic and may be used in a flow-through configuration like thatillustrated by FIG. 68. A flow through configuration may be created forvarious types of the systems like a switch or a router line card or ahost server adapter or a storage networking line card or adapter or thelike. In a flow-through configuration the security processor is directlyexposed to the traffic on the network. Thus, the central manager and therules distribution engine may directly communicate to the controlprocessor and scheduler, block 7203 or block 6809, of the securityprocessor. Security processor of block 6802 is similar to the oneillustrated in FIG. 72 without the runtime adaptable processorincorporated in it. One of the issues in a flow-through configurationthat needs to be addressed is the latency introduced in the traffic bythe security processor. The network switches or routers for example arevery sensitive to latency performance of the system, Hence in such aconfiguration a deep packet inspection can add significant latency tothe detriment of the system performance. Hence, the security processorsfor flow-through configuration of this invention provide a cut-throughlogic illustrated by block 6807 that is used to pass the data trafficfrom the input of the security processor to its output incurring aminimal latency to support the overall system performance needs. Thecontrol processor and scheduler block 7203 of FIG. 72 provides thecut-through logic and is not illustrated separately. In a flow throughconfiguration, once a match has been found the security processor maycreate special control packets internal to the system, where thesystem's switch processor or a network processor or other processors mayinterpret these messages and perform appropriate action on the packetsthat utilize the cut-through mode before those packets are allowed toexit the system. Such a protocol may be a proprietary protocol within asystem or may utilize a standard protocol as may be appropriate for thesystem incorporating a flow-through security configuration.

FIG. 73 illustrates another version of the security processor which isvery similar to that in FIGS. 71 and 72, with some additionalfunctionality. The additional functionality is provided byclassification/rules engine, block 7313, classification/rules database,block 7314 and the database extension controller block 7315. Theseblocks are similar to those of FIGS. 20 and 30 described above. Theseblocks may be used to provide high performance network layer rulesprocessing using a CAM based architecture. The ternary CAM baseddatabase may also be used to provide a fast match to specific fields ina network header to create hash keys for connection identification andconnection state retrieval or update. The control processor andscheduler decides which parts of a packet to present to theclassification/rules engine depending on the rules that are programmedin it versus those programmed in the content search and rule processingengines. A CAM based architecture typically consumes a lot of power andhence may be limited in its applications except when extremely highspeeds may be required at extremely low latencies. The content searchand rule processing may be able to provide this functionality at muchlower power as well as perform the searches for a much larger rule setcompared to that in CAM based architecture. The database extension port,block 7315, may be used to extend the CAM database size using externalclassification/rules engine.

FIG. 74 illustrates a content search and rules processing engine ofblocks 7106, 7206, 7306. The content search and rule processing engineeach comprises of interface blocks to various memories that hold therule state tables distributed by rules distribution engine of FIG. 67.This engine comprises content fetch block, 7405 which fetches the packetdata to be analyzed from the packet buffer block 7104 or equivalent fromFIGS. 72 and 73. The character class look-up, block 7406, accept/statetransition look-up, block 7407, next state look-up, block 7408,match/action look-up, block 7409 each perform state data fetches fromthe appropriate state memories of FIGS. 71, 72 and 73. The contentsearch state machine, block 7404, includes the state machine used toanalyze the fetched character or characters with the data in varioustables. The state machine uses the fetched character to index into thecharacter class table to retrieve a column address for the DFA statemachine. In parallel the state machine fetches the state transitiontable data using the current state as an index to retrieve a row addressfor the DFA state machine. The current state may be initialized to thestart state when beginning a new search, otherwise the next state thatis retrieved next becomes the current state for the next iteration ofthe state machine. The row address and the column address are then usedto retrieve the next state for the state machine. The retrieved nextstate index is also used to fetch an action tag if this is a terminatingstate. An accept state look-up performed in parallel is used to identifyif the retrieved state is a terminating state or an error state or acontinuing or accepting state. The content search state machine, block7404 effectively iterates through steps as outlined above until an erroris found or a match is found or the packet is exhausted or a combinationof these. For connection based look-up functionality, the currentinternal state like the address pointers, current state and next stateand the like are provided to the control processor and scheduler block7103 or 7203 or 7303 for it to maintain the state in the connectiondatabase. When a new packet for a given connection is scheduled thestored internal state of the content search state machine is retrievedand provided to block 7404 to start processing the new packet for theconnection as if it has been a continuous stream with previous packetsfor the connection. Security processors that include a runtime adaptableprocessor may also comprise a RAP command controller, block 7403, whichis coupled to the adaptation controller block 7205 to adapt the runtimeadaptable processor to provide the service as needed by the match andthe action tag found with that. The action tag may also be provided tothe control processor and scheduler for it to schedule the analyzedpacket to the runtime adaptable processor. The adaptation controller mayuse the command(s) provided by block 7403, as a hint or command to getthe processor ready with the needed avatar configuration information, ifit is not already present as one of the avatars in the runtime adaptableprocessor, block 7102.

As described earlier the security processor of this invention may beembedded in systems with many different configurations, dependent on thesystem environment, system functionality, system design or otherconsiderations. This patent illustrates three such configurations inFIGS. 68, 69 and 70. As discussed above FIG. 68 illustrates the securityprocessor in a network line card or an adapter providing flow-throughsecurity. In this configuration the security processor may reside nextto the media interface as illustrated or after block 6803 closer to thehost or back plane interface block 6804. Such decisions are systemdesign decisions and are not precluded from the usage of the securityprocessor of this patent. In a scenario where the security processorincorporates TCP/IP or protocol processing capability, the block 6803may not be required in some systems. FIG. 69 illustrates a look-asidesecurity configuration for a network line card or an adapter. In such aconfiguration, there exists a master processor which may be a switchprocessor, network processor, forwarding engine, or classificationengine or other processor illustrated by block 6903. The masterprocessor communicates with the central manager of FIG. 66 as describedearlier to receive the rules and to provide events back to the centralmanager, working with the security processor. The master processor mayalso incorporate functions illustrated by blocks 6902 and 6904. Themaster processor could also be a TCP/IP processor or other IP processorvariations that are feasible from the processors of this patent as well.

FIG. 70 illustrates a security and content search acceleration adapter.Such an adapter may be inserted as an accelerator card in multitude ofnetworked systems discussed above like a server, a router, a switch andthe like. The security processor on this accelerator card may be coupledto the host bus or back plane directly or through a bridge device likethat illustrated by block 7003. The security processor communicates withthe host processor or a master processor of the system to receive thepackets or content to be inspected and provides the results back. Adriver on the host or master processor may perform this communicationwith the security processor. Such a driver or other software running onthe host or the master processor may communicate with the centralmanager to receive the rules database, or updates to it or provide matchresults to the central manager based on the actions programmed Theaccelerator card may include other devices like a ternary CAM basedsearch engine that may be used to perform network layer securityfunction or connection ID detection or hash key generation or otherfunctions or a combination thereof which may assist to perform networklayer and application layer security acceleration functions discussedabove.

The security processor of FIG. 71, 72 or 73 may also be used to performcontent searches on documents or digital information and be used tocreate indexes that may be used for accelerated searches like web searchcapability provided by Google or their competitors. Using securityprocessor of this invention for such a task can provide significantperformance improvement to indexing and searches compared to that doneusing a general purpose processor based software. For such anapplication the control processor and scheduler of the securityprocessor may utilize the content search and rules processing engines toperform key phrase searches in data presented to it and get the matchindexes. These results can then be used to create a master search indexby a process that may run on the control processor and scheduler oranother processor of the system that is servicing the content searchrequest from end users. This master index may then be referred toprovide quick and comprehensive search results.

The security processor of FIG. 71, 72, 73 described above may be coupledwith elements of the processor of FIG. 16, 17, 61 or 62 to providesecurity capabilities to different versions of protocol processingarchitectures of this patent. For example, block 6615 illustrates onesuch variation where the TCP/IP protocol processor is coupled with theprocessor of FIG. 71, 72, or 73 to create another security processorwith TCP/IP processing. Similar versions may be created by including IPstorage protocol processing capability with the security processor orcoupling TCP/IP processor with RDMA capability with the securityprocessor of FIG. 71, 72 or 73 or a combination thereof. The securityprocessor of FIG. 71, 72 or 73 may also be used in place of theclassification engine, block 1703, shown in more detail in FIGS. 20 and30 as described above when the security processor is programmed tosearch for the classification fields used in block 1703.

FIG. 75 illustrates an example of regular expression rules. As discussedearlier REs have been used since mid 1950s and are used by many popularapplications. The figure illustrates a small set of regular expressionrules also called a rule tree in this discussion, which can be used toanalyze and parse some of the constructs used in HTTP. RE rules likethose illustrated with a full set of rules for the language or protocolor the like can be provided to tools like lex, flex and the like whichcan be used to create lexical analyzers which would then be used toanalyze content against the specific language or the protocol. The rulesare labeled R1 through R9 in this figure for ease of discussion anddon't represent the syntax for a specific tool. The rules illustratefour rule groups: <COMMENT>, <FNT>, <S_TAG>]and <INITIAL>. Rules R1through R3 belong to <COMMENT> rule group and similarly other rulesbelong to other rule groups as illustrated. Each rule comprises aregular expression and one or more action/tag which can be processed ifthe regular expression rule is triggered or matched. For example, R9 ispart of the <INITIAL> group which searches for an angled bracket (‘<’)in the content. If the angled bracket is found in the content, then theaction/tag is processed. In this rule the action comprises of startingthe S_TAG rule group. The assumption in this illustration is that a rulegroup labeled <INITIAL> is activated first when the content searchstarts and then the actions/tags drive the flow to move to activateother rule groups based on the content. In this illustration, when thefirst angled bracket is detected the rules that are part of S_TAG groupare activated and INITIAL group is put on hold. As different contentgets processed, the rules flow through the rule tree with the <INITIAL>rule group as the root of the rule tree where the processing of thecontent search begins. This is an example of the prior art of creatingcontent search rules.

As described earlier, regular expression can be represented using FSAlike NFA or DFA. FIG. 76 a illustrates Thompson's construction for theregular expression (xy+y)*yx. Thompson's construction proceeds in a stepby step manner where each step introduces two new states, so theresulting NFA has at most twice as many states as the symbols andoperators in the regular expression. An FSA is comprised of states,state transitions, and symbols that cause the FSA to transition from onestate to another. An FSA comprises at least one start state, and atleast one accept state where the start state is where the FSA evaluationbegins and the accept state is a state which is reached when the FSArecognizes a string. Block 7601 represent the start state of the FSA,while block 7605 is an accept state. Block 7602 represents state 2 and7604 represents state 3. The transition from state 2 to state 3 istriggered on the symbol x, 7603 and is represented as a directed edgebetween the two states. Thompson's NFA comprises of ‘ε’ transitions,7616, which are transitions among states which may be taken without anyinput symbol.

FIG. 76 b illustrates Berry-Sethi NFA for the regular expression(xy+y)*yx. Berry and Sethi described an algorithm of converting regularexpressions into FSA using a technique called ‘marking’ of a regularexpression. It results in an NFA which has a characteristic that alltransitions into any state are from the same symbol. For example, alltransitions into state 1, 7607, are from symbol ‘x’. The othercharacteristic of the Berry-Sethi construct is that number of NFA statesare the same as the number of symbols in the regular expression and onestart state. In this type of construction, each occurrence of a symbolis treated as a new symbol. The construction converts the regularexpression (xy+y)*yx to a marked expression (x₁y₂+y₃)*y₄x₅ where eachx₁, leads to the same state, 7607. The figure does not illustrate themarkings. Once the FSA is constructed the markings are removed. The FIG.76 b illustrates the NFA with the markings removed. As can be seen fromthe figure, in Berry-Sethi construction all incoming transitions into astate are all dependent on the same symbol. Similarly, a duality ofBerry-Sethi construct also has been studied and documented in theliterature as discussed earlier, where instead of all incomingtransitions being dependent on the same symbol, all outgoing transitionsfrom a state are dependent on the same symbol. The Berry-Sethi constructis also called a left-biased type of construct, where as its dual iscalled a right-biased construct.

FIG. 76 c illustrates a DFA for the same regular expression (xy+y)*yx.DFA is deterministic in that only one of its states is active at a giventime, and only one transition is taken dependent on the input symbol.Whereas in an NFA, multiple states can be active at the same time andtransitions can be taken from one state to multiple states based on oneinput symbol. There are well known algorithms in the literature, likesubset construction, to convert a RE or NFA to a DFA. One point to notefor the DFA that is illustrated for the regular expression is that ithas fewer states than both the Thompson NFA as well as Berry-Sethi NFA.The upper bound on the number of states for an N character DFA is 2^(N),however expressions that result in the upper bound in the number of DFAstates do not occur frequently in lexical analysis applications as notedby Aho, Sethi and Ullman in section 3.7 of their book on Compilersreferenced above.

FIG. 77 a illustrates a left-biased NFA and its state transition table(prior art). The illustration is a generic four state Berry-Sethi likeNFA with all transitions from each node to the other shown with theappropriate symbol that the transition depends on. For example, state A,7701 has all incoming transitions dependent on symbol ‘a’ as illustratedby example transitions labeled 7702 and 7703. When the FSA is in StateA, 7701, an input symbol ‘d’, transitions the FSA to state D with thetransition, 7704, from state A to state D. The table in the figureillustrates the same FSA using a state transition table. The column‘PS’, 7711, is the present state of the FSA, while the row ‘sym’, 7712,is a list of all the symbols that the state transitions depend on. Thetable 7713, illustrates the next state (NS) that the FSA will transitionto from the present state (PS) when an input symbol from those in thesym header row is received. In this FSA, state ‘A’ is the start stateand state C is an accept state. Hence, if the FSA is in the presentstate ‘A’ and an input symbol ‘b’ is received, the FSA transitions tothe next state ‘B’. So when the next input symbol is received, the FSAis in present state ‘B’ and is evaluated for state transition with therow corresponding to present state ‘B’.

FIG. 77 b illustrates a right-biased NFA and its state transition table(prior art). The illustration is a generic four state dual ofBerry-Sethi NFA with all transitions from each node to the other shownwith the appropriate symbol that the transition depends on. For example,state ‘A’, 7705 has all outgoing transitions dependent on symbol ‘a’ asillustrated by example transitions labeled 7708 and 7709 where as unlikethe left-biased NFA described above, each incoming transition is not onthe same symbol, for example transitions labeled 7706 and 7707 depend onsymbols ‘b’ and ‘d’ respectively. The state transition table in thisfigure is similar to the left biased one, except that the FSAtransitions to multiple states based on the same input symbol. Forexample if the FSA is in the present state ‘B’ and a symbol ‘b’ isreceived, then the FSA transitions to all states ‘A’, ‘B’, ‘C ’ and ‘D’.When an input symbol is received which points the FSA to an empty box,like 7716, the FSA has received a string which it does not recognize.The FSA can then be initialized to start from the start state again toevaluate the next string and may indicate that the string is notrecognized.

The FIG. 77 a and FIG. 77 b, illustrate generic four state NFAs whereall the transitions from each state to the other are shown based on theleft-biased or right-biased construct characteristics. However not allfour state NFAs would need all the transitions to be present. Thus if asymbol is received which would require the FSA to transition from thepresent state to the next state when such transition on the receivedinput symbol is not present, the NFA is said to not recognize the inputstring. At such time the NFA may be restarted in the start state torecognize the next string. In general, one can use these example fourstate NFAs to represent any four state RE in a left-biased (LB) orright-biased (RB) form provided there is a mechanism to enable ordisable a given transition based on the resulting four states NFA forthe RE.

FIG. 78 a illustrates state transition controls for a left-biased andright-biased NFA. The figure illustrates a left-biased NFA with a state‘A’, 7800, which has incoming transitions dependent on receiving inputSymbol ‘S1’ from states ‘B’, 7801, ‘C’, 7802, and ‘D’, 7803. However,the transitions from each of the states ‘B’, ‘C’ and ‘D’ to state ‘A’,occur only if the appropriate state dependent control is set besidesreceiving the input symbol ‘S1’. The state dependent control fortransition from state ‘B’ to state ‘A’ is V₂ while those from states ‘C’and ‘D’ to state ‘A’ is V₃ and V₄ respectively. Transition to the nextstate ‘A’ is dependent on present state ‘A’ through the state dependentcontrol V₁. Thus transition into a state ‘A’ occurs depending on thereceived input symbol being ‘S1’ and if the state dependent control forthe appropriate transition is set. Thus, one can represent any arbitraryfour states NFA by setting or clearing the state dependent control for aspecific transition. Thus, if a four states left biased NFA comprises oftransition into state ‘A’, from state ‘B’ and ‘C’ but not from thestates ‘A’ or ‘D’, the state dependent controls can be set as V₁=0,V₂=1, V₃=1 and V₄=0. Hence if the NFA is in state ‘D’ and a symbol ‘S1’is received, the NFA will not transition into state ‘A’, however if theNFA is in state ‘B’ and a symbol ‘S1’ is received the NFA willtransition into state ‘A’.

Similarly, FIG. 78 a also illustrates states and transitions for aright-biased NFA. The figure illustrates a right-biased NFA with a state‘A’, 7806, which has incoming transitions from state ‘B’, 7807, state‘C’, 7808, and state ‘D’, 7809, on receiving input symbols ‘S2’, ‘S3’and ‘S4’ respectively. However, the transitions from each of the states‘B’, ‘C’ and ‘D’ to state ‘A’, occur only if the appropriate statedependent control is set besides receiving the appropriate input symbol.The state dependent control for transition from state ‘B’ to state ‘A’is V₂ while those from states ‘C’ and ‘D’ to state ‘A’ is V₃ and V₄respectively. Transition to the next state ‘A’ is dependent on presentstate ‘A’ through the state dependent control V₁. Thus transition into astate ‘A’ occurs based on the received input symbol and if the statedependent control for the appropriate transition is set. Thus, one canrepresent any arbitrary four states right-biased NFA by setting orclearing the state dependent control for a specific transition. Allstate transition controls for a given state form a state dependentvector (SDV), which is comprised of V₁, V₂, V₃, and V₄ for the examplein FIG. 78 a for the left-biased and the right-biased NFAs.

FIG. 78 b illustrates configurable next state table per state. Theleft-biased state table for ‘NS=A’, is shown by the table 7811, whereasthe right-biased state table for ‘NS=A’, is shown by the table 7812. Thestate dependent vector for both left-biased and right-biased NFA stateis the same, while the received input symbol that drive the transitionare different for the left-biased vs. right-biased NFA states. Thus astate can be represented with properties like left-biased (LB),right-biased (RB), start state, accept state, SDV as well as action thatmay be taken if this state is reached during the evaluation of inputstrings to the NFA that comprises this state.

FIG. 79 a illustrates state transition logic (STL) for a state. The STLis used to evaluate the next state for a state. The next state computedusing the STL for a state depends on the current state of the NFA, theSDV, and the received symbol or symbols for a left-biased NFA andright-biased NFA respectively. The InChar input is evaluated againstsymbols ‘S1’ through ‘Sn’ using the symbol detection logic, block 7900,where ‘n’ is an integer representing the number of symbols in the RE ofthe NFA. The choice of ‘n’ depends on how many states are typicallyexpected for the NFAs of the applications that may use the searchprocessor. Thus, ‘n’ may be chosen to be 8, 16, 32 or any other integer.The simplest operation for symbol detection may be a compare of theinput symbol with ‘S1’ through ‘Sn’. The output of the symbol detectionlogic is called the received symbol vector (RSV) comprised of individualdetection signals ‘RS1’ through ‘RSn’. LB/RB# is a signal (i.e. aleft-biased or right-biased signal) that indicates if a left-biased NFAor a right-biased NFA is defined. LB/RB# is also used as an input inevaluating state transition. The STL for a state supports creation of aleft-biased as well as right-biased NFA constructs. The LB/RB# signalcontrols whether the STL is realizing a left-biased or a right-biasedconstruct. The state dependent vector in the form of ‘V1’ through ‘Vn’,is also applied as input to the STL. The SDV enables creation ofarbitrary ‘n’-state NFAs using STL as a basis for a state logic blockillustrated in FIG. 79 b. Present states are fed into STL as a currentstate vector (CSV) comprised of ‘Q1’ through ‘Qn’. STL generates asignal ‘N1’ which gets updated in the state memory, block 7902, on thenext input clock signal. ‘N1’ is logically represented as N1=((V1 and Q1and (LB/RB# OR RS1)) OR (V2 and Q2 and (LB/RB# OR RS2)) OR . . . (Vn andQn and (LB/RB# OR RSn)) AND ((NOT LB/RB# OR RS1). Similar signal foranother state ‘n’, would be generated with similar logic, except thatthe signal 7901, feeding into the OR gate, 7915, would be ‘RSn’, whichis the output of the ‘n’-th symbol detection logic, changing the lastterm of the node ‘N1’ logic from ((NOT LB/RB# OR RS1) to ((NOT LB/RB# ORRSn). The state memory, 7902, can be implemented as a single bitflip-flop in the state logic block discussed below.

FIG. 79 b illustrates a state logic block (SLB). The SLB comprises theSTL, 7906, Init logic, 7908, state memory, 7910, the accept state detectlogic, 7911, the SDV for this state, 7907, start flag, 7909, acceptflag, 7912, tag associated with this state, 7919, or action associatedwith this state, 7913 or a combination of the foregoing. The SLBprovides programmable storage for each of the following: SDV, startflag, accept flag, tag field, action field, left-biased or right-biasedsignal, and the state memory or a combination thereof, which may beprogrammed using application state controller, block 8000, working withthe cluster configuration controller, 8607, and global configurationcontroller, 8505, discussed below to configure a state of realized FSAin SLB at runtime. The SLB receives current state vector and thereceived symbol vector which are fed to STL to determine the next state.The realization of a state of an arbitrary NFA can then be done byupdating the SDV for the state and selecting the symbols that the NFAdetects and takes actions on. Further, each state may get marked as astart state or an accept state or a combination or neither start oraccept state through the start and accept flags. The init logic block,7908, receives control signals that indicate if the state needs to beinitialized from the start state or cleared or disabled from updates, orloaded directly with another state value, or may detect a counter value,discussed below, and decide to accept a transition or not and the like.The Init block can be used to override the STL evaluation and set thestate memory to active or inactive state. The STL, 7906, providesfunctionality as illustrated in FIG. 79 a, except that the state memoryis included in the SLB as independent functional block, 7910. The statememory, 7910, can be implemented as a single bit flip-flop. When thestate memory is set it indicates that the state is active otherwise thestate is inactive. The accept detect logic, 7911, detects if this statehas been activated and if it is an accept state of the realized NFA. Ifthe state is an accept state, and if this state is reached during theNFA evaluation, then the associated action or the tag is provided as anoutput of the SLB on the A1 signal, 7916, and an accept state activationindicated on M1, 7917. The action or the tag could be set up to beoutput from the SLB on the state activation as well as when the state isnot an accept state as required by the implementation of the NFA. Thiscan enable the SLB to be used for tagged NFA implementation where anaction or tag can be associated with a given transition into a state.

If there are ‘n’ states supported in the search engine, discussed later,each SLB needs ‘n’-bit SDV which can be stored as a register comprisedof flip-flops, 2-bits allocated to start flag and accept flags, 1-bitfor LB/RB#, m-bit action storage and t-bit tag storage. Thus if n=16 andm=6 and t=4, then the total storage used per SLB would be a 29-bitregister equivalent which is a little less than 4 bytes per state.

FIG. 80 illustrates an NFA based search engine. The search engine maycomprise of ‘n’ SLBs, SLB1 through SLB n, block 8002, Application statecontrol, block 8000, Application state memory, block 8001, Counters,block 8004, Match logic and accept state detect, block 8005, FSAextension logic, block 8006, Symbol detection array, block 8007 orPattern substitution control, block 8008 or a combination of theforegoing to generate a runtime adaptable NFA search engine. The numberof states included per search engine may be any integer number ‘n’. Forexample ‘n’ may be 8, 16 or 32 or any other integer. A regularexpression with up to ‘n’ symbols may be fully supported by the searchengine. If the number of states resulting from the RE are higher thanthat supported by a search engine, then it is possible to connect two ormore search engines together using the FSA extension logic, block 8006,to handle a larger RE. If the number of states of a RE are smaller than‘n’, the search engine allows multiple NFAs per search engine. This canresult in efficient utilization of the search engines resources unlikesolutions that do not allow multiple REs per search engine. There may be‘M’ rules that may be supported per search engine where M can be anyinteger less than ‘n’. For instance if ‘M’ is 4 and ‘n’ is 16, it ispossible to configure four RE rules in the search engine forsimultaneous evaluation. There may be many more rules that may be storedin the application state memory, though ‘M’ is the number that may besimultaneously configured in the SLBs for evaluation. The total numberof rules and application contexts stored in the application state memoryis a function of the amount of memory allocated per search engine and isnot limited to ‘n’ or ‘M’. The SLBs are the state logic blocksillustrated in FIG. 79 b. The number of SLBs that are included in thesearch engine are the same as the number of states supported by thesearch engine.

The search engine receives as inputs the FSA control and data buses,‘FSACtrl’ and ‘FSAdata’ respectively, the input symbol on the bus‘InChar’ and the substitution control and data on the bus ‘SubBus’. Theother standard inputs like the power, clock, reset and the like whichmay be required are not illustrated in this figure or other figures inthis disclosure to not complicate the illustrations. Use of such signalsare documented in ample in basic hardware design textbooks and is wellunderstood by any one with ordinary skills in the art. Though thedescription is with respect to the inputs as illustrated, theinformation may be received on multiplexed busses or other parallelbusses and hence this description is not to be taken as a limitation ofthe invention. These signals or busses are illustrated for the ease ofdiscussion. The search engine also comprises an application state memorywhich holds application specific NFA rule context for multipleapplications like HTTP, SMTP, XML and the like. The application statememory holds information like SDV, initial state vector (ISV), acceptstate vector (ASV), LB/RB#, Actions/Tags, number of rules and the likefor each application NFA. The application state control, block 8000, isused by the search processor's global configuration controller,described later, to configure the appropriate NFA parameters in theapplication state memory for each of the supported applications. Thenthe configuration controller can instruct the application state controlto program the SLBs using a particular application NFA for performingsearch for that application. There may be multiple REs or rules perapplication in the search engine if the number of states for the REstogether is less than that supported by the search engine. Theapplication state control can retrieve the appropriate applicationcontext from the application state memory and program each SLB with theappropriate information to implement the NFA programmed for theapplication. The parameters being configured in each SLB comprises theSDV, start state, accept state, LB/RB#, action or tag. The applicationstate controller also configures the symbol detection array, block 8007,using the symbols for the NFA retrieved from the application statememory. Appropriate counter initialization may also be done by theapplication state control for the NFA before enabling the search engineto start processing input symbols received on ‘InChar’. The stateoutputs of the SLBs are concatenated together to form the CSV. The RSVis the output from the symbol detection array. The RSV and CSV are fedinto each SLB. The SLBs then perform the state transition evaluationbased on their configuration until an accept state is reached or the NFAreaches a null state which indicates that the input string is the onerecognized by the NFA or rejected by the NFA respectively. The symboldetection array, 8007, comprises of an array of computation engines,which in the most simplistic form can be comparators, and memory forsymbols being searched. The input symbol received on ‘InChar’ iscompared simultaneously with all the symbols and then the appropriatesignals of the RSV are set if the symbol being searched matches thereceived symbol. The symbol detection array may also interact with thecounters block, 8004, to enable rules that detect occurrence of symbolsbefore taking a specific transition. The accept state and the actionfields from each SLB is used by the match and accept state detectionlogic, block 8005, to detect if an NFA has reached an accept state andit then registers the action for the NFA to provide to the globalconfiguration controller discussed later. The action or tag may providean indication that a RE rule has matched, but may also provide an actionthat may be taken on the input stream. For example, the action mayindicated that the input stream contains a computer virus being searchedand hence the input stream should be quarantined for instance or it maydetect that confidential information being searched is part of thestream and hence the transmission of this stream should be stopped andthe like. The action or tag may also comprise of a pointer to a coderoutine that the global controller or the control processor, discussedlater, may perform. The code routine may cause the configuration of thesearch engine to change to search for a different set of rules. Forexample, if the NFA was part of a rule set <INITIAL> discussed earlierand if it triggers, then the action to be taken may be to activate<S_TAG> set of rules and put the <INITIAL> set on hold for instance.There are innumerable type of tasks that may be accomplished using theinvoked code routine as a result of the RE rule being matched as wouldbe obvious to one skilled in the art. Complex set of actions can betaken on the input stream or other control functions may be activated ormanagement code may be activated or the like once an NFA reaches anaccept state if the associated action or the tag is programmed for suchan action to occur. The indication of a match as well as the rule IDbeing matched may be output on the M-bus, 8009, by the match logic,8005, for the cluster controller, discussed later, or the globalconfiguration controller to take appropriate action. The action/tag,8010, carries the associated action or tag to the cluster controller orthe global configuration controller which may follow actions asdescribed above. The match logic is set up to support ‘M’ simultaneousrule evaluation by the search engine.

There are instances of regular expressions where a certain number ofevents need to be counted before a state transition occurs. For examplea RE rule may be created like [3,5]abc which indicates that the RErecognizes a string that has 3 to 5 leading ‘a′s followed by ‘b’followed by ‘c’. If the input string is aabc' or ‘abc’ or ‘aaaaaabc’,then it would not be recognized by this RE, whereas an input string‘aaaabc’ or ‘aaabc’ or ‘aaaaabc’ would be recognized by this RE. Toenable such REs there are counters provided in the search engine as partof the counters block, 8004. There may be ‘L’ sets of counters tosupport multiple rule execution simultaneously. If the some rules do notneed to use the counters, then those counters may be allocated for useby other rules. The choice of ‘L’ may be a function of the search engineresource availability and the need of counters for applications beingsupported using the search engine. The counter values may be read orwritten to by the application state controller on the configuration ofan application state into the SLBs if the NFA configured in the searchengine needs counters. The output from the counters block is provided tothe SLBs for them to make decisions on the counter values as necessary.The counters may also be read and written by the symbol detection arrayfor it to support RE rules that depend on counting symbols occurring fora certain number of times like the one described above. The counterblock also holds counters to keep track of the string match beginningand the end for each of the ‘M’ rules. Begin and end counts indicate tothe cluster controller or the control processor the start marker for thestring match in the input stream and the end marker in the input streamrespectively.

The FSA extension logic, block 8006, is used to extend the number of NFAstates that can be supported for a RE. Thus if there are REs that needmore states then ‘n’ supported by an individual search engine, it ispossible to connect multiple search engines into groups together tosupport the RE. The FSA extension logic block provides the input andoutput signals that enable a search engine to know the state of othersearch engines of the group. The signals may be fed into the SLBs ascontrol signals, though are not explicitly illustrated in FIG. 79 a orFIG. 79 b or FIG.

80.

The search engine may also comprise of a pattern substitution control,8008, which may hold pattern(s) to be used for substituting the inputstring as the search engine progresses through NFA. This can supportfinite state transducer functionality which is well understood anddocumented in literature. This block may evaluate the transitions takenthrough the NFA by keeping track of two consecutive CSVs and makingdecision about the transitions enabled and the associated transducersymbols to be generated for substituting the received input string.

A search engine with ‘n=16’ states would require 16×4=64 Bytes ofstorage for SLBs, plus 16 bytes for 16 symbols, and M-counters where ifM=4, then there may be 8 bytes for counters if each counter is 16-bit.Thus total storage required for an application state context, discussedbelow, is around 90 bytes for n=16.

FIG. 81 illustrates Application State Memory example configuration. Theapplication state memory, 8100, is the same as the block 8001 in theFIG. 80. FIG. 81 illustrates how that memory may be organized to storeapplication state context for various applications. An application statecontext comprises the information like the SDV, Start State, AcceptState, Symbols for the NFA evaluation, actions and tags for the statesand the like. This is the information that the application state controlcan retrieve from the application state memory and configure the SLBs,Counters, Symbol detection array and the match detection and acceptdetection logic blocks of the search engine. It may also include theindication whether this particular RE needs grouping with other searchengine block to support a larger RE than that supported by the number ofstates supported by the search engine. Thus the context carries all thestate information that is necessary to evaluate an NFA. The number ofapplication state contexts ‘N’ that can be stored in the applicationstate memory is purely a function of the amount of application statememory allocated per search engine and the storage required per context.When ‘n=16’, if the amount of storage allocated per search engine is 512bytes, then 5 application contexts can be stored in the applicationstate context. A 1 KB of memory can allow 10 application contexts to bestored. However, if the number of states per search engine is 8, thenumber of contexts can be up to two times of those in the case of 16states.

There are ‘N’ application state contexts (ASC), 8101, 8102, 8103 storedin the application state memory, 8100. ASC1, 8101, is illustrated to beone of the RE rules for anti-spam application, whereas ASCn, 8103, isillustrated to be one of the RE rules for XML application. Each ASC maybe further organized into NFA parameters, 8104, and SLB parameters,8105. The SLB parameters are state specific parameters like Start Flag,Accept Flag, SDV, Action and the like that are applied to the specificSLB. These could simply be a 4-byte memory location. Similarly the NFAparameters are those that are applicable for the entire NFA, likesymbols, number of rules per NFA, substitution table and the like. RB/LBinformation may be common to the entire NFA or may be stored in a perstate basis.

The benefit of ASCs is that when the control processor or globalconfiguration controller or cluster controller, discussed later, need toapply the rules for a different application than that currentlyconfigured in the search engine, it can make the transition to rules forthe new application rapidly by informing the application state controlof each search engine. Thus in an array of search engines which may beused to perform evaluation of a large number of REs in parallel, theswitching to an application context can be very rapid, unlike if theconfiguration information has to be retrieved from a global memory orexternal system memory and provided to each search engine. In theexample of n=16 states, it may be possible to switch the applicationcontext for each search engine within 20 to 25 clocks. If there are 1000search engines being used in parallel to do 1000 RE rule searches, andif the application based context has to be loaded into each of them froma global memory or external memory, it can take potentially 20,000 to25,000 clock cycles, which can bring down the performance of the searchprocessor significantly compared to using ASCs per search engine as inthis invention.

Even though the above discussion is with respect to differentapplications, it may be possible to use different REs for the sameapplications in different ASC. This can allow the search processor to beable to support a larger number of REs for an application, though therewould be performance impact if all the REs have to be applied to theinput stream. In such a case, the input stream may be presented multipletimes to the search engine, each time switching the ASC, which in factthen applies a new RE for the application to the stream. Another, usagemodel of the ASC can be for the same applications' nested RE rules. Forexample, as illustrated in FIG. 75 the RE rules are nested rules, whereonly a limited set is being applied to the input stream in parallel. Insuch a case, one rule from each level of nesting could be set up inindividual ASCs and the action programmed could indicated that when thecurrent rule reaches accept state the context needs to be switched toanother ASC. For example, R9, R7 and R4 could each be stored in threeindependent ASCs with Rule R9 being used as the initial context. Now ifthe input string matches R9, then the context can be switched to bringin rule R7 which then looks for the ‘FONT’ string. If R7 reaches anaccept state, then the context can be switched to R4 and so on. Thus anested RE rule set for any given application can be easily supported inthe search engine of this invention. Thus one can create a runtimeadaptable finite state automaton architecture using various runtimeconfigurable and programmable features of the NFA based search enginedescribed above.

As discussed earlier a regular expression can be converted to NFA andDFA using well known methods. Further, it is also well known that an NFAand a DFA for the same RE are both equivalent. However, there aredifferences in terms of the performance metrics of an NFA vs. a DFA asdiscussed earlier. DFAs can result in a large storage requirement forthe states in a worst case compared to NFAs. However, in practice suchis not the case and many times a DFA may have fewer states than a NFA.Thus a content search processor using only NFA based search engines maynot realize the most efficient implementation. One can use a combinationof NFAs and DFAs to evaluate a set of regular expressions. Further, ifthe DFA implementation can be more efficient than the NFA, then morerules can be supported on a single integrated search processor comparedto the one that purely uses NFAs. DFAs may have transitions to multiplestates from a state, each transition taken on a different input symbol.Thus to process a state, one would need to detect multiple symbols andbased on the input symbol choose the appropriate transition. In DFAsonly one state is active at a time by the deterministic nature of DFAs.Thus, it may be possible to select DFAs that are like the oneillustrated in FIG. 76 c, where the number of states in the DFA are lessthen or comparable to those in an NFA and the number of transition froma state are also limited, in this case an upper bound of two and programthem for evaluation using a DFA based search engine like thatillustrated in FIG. 82. As long as the number of states of the DFA arenot more than that can be held in the DFA instruction and state flowmemory, described below the DFA could be chosen instead of the NFA forevaluation.

FIG. 82 illustrates a DFA based search engine. The DFA search enginecomprises of DFA operations engine, 8204, DFA state controller, 8203,DFA context control, 8200, DFA Instruction and state flow memory, 8201,counters, 8202, DFA extension logic, 8208, occurrence counter, 8206, Min& Max counters, 8207, or pattern substitution control, 8205, or acombination of the foregoing to generate a runtime adaptable DFA searchengine. The DFA search engine also has the facility to create extensionof the FSA using the DFA extension logic, 8208, which providesfunctionality similar to the FSA extension logic, 8006, of FIG. 80. Thusmultiple DFA blocks can be grouped together to support REs that resultin larger DFAs than those supported by one DFA search engine. The DFAoperations engine is an engine which is a small processor that supportsexecuting targeted DFA specific operations. An example of the type ofDFA operations are illustrated in FIG. 83. DFA State Controller, 8203,receives commands from the cluster controller or the globalconfiguration controller to program the instruction memory & state flowmemory, 8201 with the instruction sequence using the DFA operations thatwould form the DFA flow. The DFA context Controller, 8200, controls theapplication and the DFA context that is currently being executed. TheDFA search engine also supports application state contexts similar tothe NFA search engine of FIG. 80 and they are stored in the instructionand state flow memory, 8201. The application context for the DFA isprimarily comprised of instruction sequence for the RE rules for eachapplication. Similar to the NFA search engine discussion above, the DFAsearch engine also can hold multiple REs for the same application andthere by support the RE nesting as described earlier for NFAs. The DFAoperations engine, 8204, is essentially a scalar processing engine thatsupports all DFA operations, like the example described below. DFAoperations engine may be a multi-way superscalar architecture wheremultiple instructions can be processed simultaneously. The multi-wayinstructions would essentially be state transition evaluationinstructions which would take the DFA to one of the multiple next statesbased on the received input symbol. If the DFA operation engine supportstwo simultaneous evaluations to support DFAs with at most twotransitions per state, then each RE could be converted into DFAs to meetsuch requirement or one could compare the cost of implementing an RE asan NFA or a DFA and choose the least costly realization by mapping thatRE to the appropriate search engine on the content search processor. TheDFA operations engine in general does not retain state information,since the DFA instruction flow holds the knowledge of the state.Further, the DFA operations engine evaluates each incoming symbolthrough the same engine and hence it does not have the hardware overheadof multiple symbol detection array unlike the NFA search engine. Thiscauses the DFA search engine to be significantly cheaper to implementcompared to an NFA search engine where the number of device utilizationdifference could be up to 5 times. DFA context controller, 8200,provides the context and the pointer of the instruction within thecontext to the instruction memory, which retrieves the instructionvector(s) to be evaluated by the DFA operations engine, 8204. Theinstruction vector is comprised of the instruction or the operation, theexpected symbol(s), next state pointer(s), a default state, or controlinformation or a combination of the foregoing. The instruction vectormay contain information for multiple transitions if a multi-wayarchitecture is selected for the DFA search engine. The DFA operationsengine executes or evaluates the instruction against the received/inputsymbol and makes a decision as to which state to transition to next andstart its evaluation. If there is no match with the received symbol andthe expected DFA operation, the flow transitions to a default statewhich may be the start state or an error state. If the DFA operationengine reaches an accept state then an appropriate match signal isgenerated on the M-bus by the DFA operations engine along with the tagor the action associated with this accept state. This information isprocessed by the cluster controller or the global configurationcontroller, who decide the next course of action to follow for theinput. The DFA operations engine has access to occurrence counter, 8206,min & max counters, 8207 and character counters, 8202 which are used tosupport various DFA operations. The character counters are used to markbeginning and end of a string that matches the DFA rule. Patternsubstitution control, block 8205 and DFA extension logic, block 8208,provide functionality similar to that discussed for the blocks 8008 and8006 respectively, illustrated in FIG. 80.

FIG. 83 illustrates example DFA operations. The operations have beendefined to model common regular expression operators. The number andtype of DFA operations can be different from those illustrated withoutdeviating from the teachings of this patent. For example an illustratedoperation ‘EQ X’ may be used to evaluate if the received symbol is equalto ‘X’. If the evaluation results in an affirmative answer thetransition that requires this evaluation can be taken. The DFAoperations illustrated form the instructions that get processed orevaluated by the DFA operations engine, 8204. DFA instructions comprisemore fields than those illustrated in the FIG. 83. The DFA instructionmay comprise DFA Operation, like that illustrated in FIG. 83, expectedsymbol(s), target instruction address (s) to branch to on successfulevaluation of the instruction, alternate target instruction address ifthe evaluation is not successful, or other control fields like settingof counters, retrieval of counters, output an accept state flag, actionor tag and the like or a combination of the foregoing. The number ofsymbols being evaluated and the target instruction addresses woulddepend on the number of parallel transitions being supported. When theDFA operations engine is two-way superscalar the number of symbols wouldbe two and there would be two target addresses associated with thosebeside the default address. Each instruction of the DFA operation wouldrepresent the evaluation of transitions from a state of the DFA. Thuswhen a branch is taken to a target instruction address the instructionat the target would typically be the one for the state that the DFAtransitions into, unless this is an error state in which case the DFAmay be restarted at the initial or the start state. For example onecould code the DFA illustrated in FIG. 76 c as follows using some of theDFA operations illustrated in FIG. 83. The example below is not limitingformat for the DFA operations and is used for illustration purpose toexplain the DFA search engine operation better.

State/IP Op1 S1 T1 Op2 S2 T2 Other A: 0 EQ ‘x’ 1 EQ ‘y’ 2 0, start countB: 1 EQ ‘y’ 0 EQ — — 0 C: 2 EQ ‘y’ 2 EQ ‘x’ 3 0 D: 3 GEN_FLAG — — EQ ‘y’0 0, action, tag, flag, end count

The fields of the instructions above are op1 (first operation), S1(symbol1), T1 (IP Target1), op2 (second operation), S2 (symbol2), T2 (IPTarget2) and other fields (like Default target, init start count, initend count, action, tag and the like). For evaluation of the DFAillustrated in FIG. 76 c, such a code may be stored in the DFAinstruction and state flow memory from where each instruction isretrieved and processed. The column with the header “State/IP” isrepresenting the corresponding state of the DFA and the instructionpointer (IP) value for the corresponding instruction. There may be apreamble code to that illustrated that may need to be executed beforethe DFA code to setup appropriate counters, error state processing andthe like. However, in the illustration of the example above the startstate ‘A’ and its IP is used as the default target if the input symbolsare not those expected by the DFA. The DFA evaluation starts in thestart state at IP=0. The instruction at IP=0 is first fetched andevaluated against the input symbol. If the input symbol is a ‘y’ thenthe DFA is expected to transition from state ‘A’, 7612, to state ‘C’,7614, however if the received symbol is ‘x’ then the transition is tostate ‘B’, 7613. Assuming that the input symbol is a ‘y’, then theinstruction evaluation of the first instruction results in anaffirmation of the comparison of symbol2 (S2) with ‘y’ and hence theassociated target IP of 2 is selected as the next instruction to fetchand evaluate against the next symbol. The IP=2, corresponds to state‘C’, which is expected to be the transition that is defined by the DFAof FIG. 76 c. The DFA search engine now processes instruction at IP=2and then follows the appropriate flow as defined by the instructions.When the execution reaches IP=3, which corresponds to the accept state‘D’, 7615, the DFA search engine outputs a match flag and associated tagor action to indicate that a string matching the current DFA has beenfound. The DFA engine also records the end count to indicate where thestring match ends in the input stream. The DFA engine with multi-wayexecution like the one illustrated above may also provide priority tothe multiple operations being performed simultaneously, where the firstoperation's affirmative result may get a higher priority over the otheroperations. In general, since the DFAs are deterministic, only one ofthe listed operations of the instruction should result in affirmativeresult though such a mechanism may be created to avoid executionambiguity.

Though the example above used only two types of operations, there aremany other operations that DFA operations engine can perform asillustrated in FIG. 83. There may be RANGE operation defined whichenables one to search for a symbol ‘S’ to fall within the range of ASCIIcodes for two symbols with ‘X’ indicating the lower bound of the rangeand ‘Y’ indicating the upper bound of the range. ‘X’ and ‘Y’ in thiscase may be any 8-bit or 16-bit extended ASCII codes. Similarly, a setof operations are defined that enables the DFA code to loop in a givenstate until a certain condition is met or if a certain number ofoccurrences of a certain symbol occurs. There are other ways to realizethese operations which would be within the scope of this invention asmay be obvious to one skilled in the art.

The DFA based search engine of FIG. 82 is essentially an applicationspecific processor which operates on targeted instructions like thoseillustrated in FIG. 83. The DFA search engine operates by having the DFAcontext control, 8200, fetch the instructions from the DFA Instructionand state flow memory, 8201, which are than evaluated by DFA operationsengine, 8204, by performing the operation indicated by the instruction,which may typically involve evaluating the input symbol received on‘InChar’ with those expected by the instruction and then decide the nextinstruction to fetch based on the result of the evaluation. The DFAoperations engine, 8204, provides the information about the nextinstruction to the DFA state and context control blocks 8203 and 8200,which in turn provide the next instruction address to the DFAinstruction and state flow memory, 8201. The performance of thisprocessor can reach the same level or higher as that of state of the artmicroprocessors using advanced process technologies and there by achieveline rate performance of 1 Gbps to 10 Gbps and higher. Thus one cancreate a runtime adaptable finite state automaton architecture usingvarious runtime configurable and programmable features of the DFA basedsearch engine described above.

A DFA search engine as illustrated in FIG. 82 may be implemented in 50to 200 logic elements in today's FPGA, depending on the supported DFAoperations and multi-way execution support. The instruction and flowstate memory may by 512 bytes per search engine or less depending on thenumber of application contexts that get supported and the number ofstates that need to be supported. This can allow over a thousand DFAs tobe included in a single FPGA which can have close to 200 thousand logicelements. Thus DFA and NFA based content search processor may beimplemented in today's best of class FPGAs supporting one to twothousand REs in one FPGA which may be operated at the clock rateachievable in the FPGA. Higher number of REs can be supported in an FPGAformat when multiple application rules are taken into account, which candrive the number into multiple thousands of rules. This invention thussupports creating content search solutions using FPGAs beside ASICs,ASSP and other forms of integrated circuits, which may be useful whenthe time to market is important and the volumes are not high enough tojustify the expense of creating ASICs, ASSPs or other ICs. When an NFAand DFA search engines of this invention are implemented on the best ofclass process technologies like 90 nm or 65 nm that are used in today'sleading edge microprocessors, the performance of these engines can betaken to the same or higher frequencies and thus can enable very highline rate processing of regular expressions unlike today'smicroprocessors. Unlike below 100 Mbps performance of a 4-GHz processorfor evaluating a few hundred NFAs or DFAs as described earlier, athousand RE content search processor with NFA and DFA search engines ofthis invention could achieve clock rates similar to the leadingmicroprocessors and evaluate around one input symbol per clock, whichmay be a single character or more characters, thereby achieving wellabove 10 Gbps performance. Further, with the device densities growingwith small geometry process technologies like 90 nm, 65 nm and lower,the number of REs that can be incorporated on an IC can be in multiplethousands depending on the choice of the mix of NFAs and DFAs. By usinga mix of NFAs and DFAs on the content search processor it would bepossible to create a series of products that offer different numbers ofboth in a chip, for example one product can offer 25% DFAs and 75% NFAs,while another could offer 75% DFAs and 25% NFAs for instance. Thus itmay become possible to meet varying market needs. If an applicationrequires more REs than those provided on a single content searchprocessor, multiple processors can be used in parallel to increase thenumber of REs. Thus, it is possible to use this invention to supportsearch applications requiring thousands of REs without the need to growthe memory usage like that would be needed in a composite DFA that makeit uneconomical for general use.

FIG. 84 illustrates a runtime adaptable search processor. The processorcomprises a control processor and scheduler, 8402, a coprocessor/hostinterface controller, 8401, data/packet buffers, 8403, adaptationcontroller, 8404, a runtime adaptable search processor array, 8405,configuration and global memory, 8406, or memory controller, 8407, or acombination of the foregoing. The runtime adaptable search processor isalso referred as a search processor or adaptable search processor orcontent search processor or runtime adaptable content search processorin this invention. Similarly, runtime adaptable search processor arrayis also referred to as the search processor array in this invention. Thesearch processor may be embedded in multiple types of systems in varyingconfigurations similar to those in FIG. 68, FIG. 69 and FIG. 70, bysubstituting the security processor of these figures with the searchprocessor. Search solutions derived from the processors of thisinvention comprise line cards which may incorporate the searchprocessor, security processor, SAN protocol processor, TCP/IP processoror runtime adaptable protocol processor or various other processorsdisclosed in this patent or a combination of the foregoing. The linecard configuration and the architecture may vary with the specificsystem and the application. Three types of line card architectures, a)flow-through b) look-aside and c) accelerator card, are illustrated inthis patent to illustrate usage models for the processors of thispatent. FIG. 68, FIG. 69 and FIG. 70 illustrate these configurationsusing a security processor based system, though it could also be basedon the search processor or other processors of this patent as discussedearlier. Blocks 6612 and block 6613 illustrate two of these types ofcard configurations. The security processor illustrated in these cardsmay be substituted with the search processor disclosed in this patent.There are various different variations of the search processor that canbe created depending on the functionality incorporated in the processor.Blocks 6614 and block 6615 illustrate two versions of a securityprocessor which could also be used to illustrate search processors byreplacing the security processor with search processor. Block 6614illustrates a security processor comprising at least a content searchand rule processing engine coupled with a runtime adaptable processor. Asimilar search processor can be created by substituting the runtimeadaptable processor with a runtime adaptable search processor array,8405, as disclosed in this application. This processor is similar tothat illustrated in FIG. 71 with the same substitution and is describedin detail above within the context of a security processor. Block 6615,illustrates the security processor of block 6614 coupled with a TCP/IPprocessor or a protocol processor to provide more functionality usablein a security node as a security processor. Similarly, a searchprocessor may also be coupled with a TCP/IP processor or a protocolprocessor to provide more functionality usable in a content searchprocessor node that may be used in a network as a network element or aclient node or a server node. Such a processor may be used to enableapplication aware networking capabilities to achieve intelligentnetworking. The choice of the search processor may depend on the systemin which it is being deployed, the functionality supported by thesystem, the solution cost, performance requirement, or other reasons, ora combination thereof The search processor may use one or more ports toconnect to external memories, block 6616, which may be used to storeadditional rules beyond those stored on the search processor, or otherintermediate data or packets or other information as necessary toperform various functions needed for search processing. The memories maybe of various types like DRAM, SDRAM, DDR DRAM, DDR II DRAM, RLDRAM,SRAM, RDRAM, FCRAM, QDR SRAM, DDR SRAM, Magnetic memories, Flash or acombination thereof or future derivates of such memory technologies. Theinventions disclosed in this patent enable many variations of thearchitectures illustrated and may be appreciated by those skilled in theart that changes in the embodiments may be made without departing fromthe principles and spirit of the invention.

The search processor of FIG. 84 may be used as a coprocessor in a systemwhich needs acceleration for content search of local content, remotecontent or content received in network packets or the like. Theconfiguration in which this processor may be used is similar to thatillustrated in FIG. 69 and FIG. 70 where the security processor of theblocks 6906 and 7005 may be replaced with the search processor of FIG.84 or other variations of the search processor as described above. Thesearch processor may be incorporated on an add-in card which may act asan acceleration card in a system as illustrated in FIG. 89, where thesearch processor, 8904, may be that of FIG. 84 or other variations ofthe search processor as described above. The search processor may alsobe configured in a form illustrated by FIG. 71 where the runtimeadaptable processor, block 7102, may be substituted with runtimeadaptable search processor array, block 8405. Such search processor canhave the benefit of allowing the number of rules the processor cansupport to grow beyond the number supported by the runtime adaptablesearch processor array. Thus if this search processor array is able tosupport a couple thousand REs, then if the user need to grow the numberof REs beyond that, they may be supported as a composite DFA that may beimplemented on the content search/rule processing engines, 7106. Thusthe user may be able to grow the number REs upto a point that theexternal memories can support without needing to add another searchprocessor. The content search processor of this application may also beused as a security processor in applications that require deep packetinspection or require screening of the application content likeanti-spam, anti-virus, XML security, web filtering firewalls, intrusiondetection/prevention and the like. Thus all figures in this applicationthat refer to security processor may also incorporate a searchprocessor.

The search processor receives the content it needs to search from themaster processor in a coprocessor configuration like that of FIG. 69,FIG. 70, FIG. 71, FIG. 73 or FIG. 89 or it may receive the content fromthe network when coupled with the network interfaces like those in FIG.68 or FIG. 72. The control processor and scheduler, 8402, deposits dataor information content to be searched into data/packet buffers, 8403,for scheduling them into the search processor array when the processorarray is ready to receive it. In general the search processor can meetthe network line rates from below 100 Mbps, 1 Gbps to 10 Gbps andhigher, and hence the packets may not need to be buffered. However, itmay be necessary to buffer the packets if the packets or content comesinto the search processor at a rate higher than what it can handle. Itmay also be necessary to schedule the packets that belong to the sameflow (as in the same transport layer connection for instance) to beprocessed by the search processor in the correct sequential order. Thepackets of such flows may also get stored in the data/packet buffersuntil packets in the flow before them get processed. The searchprocessor may also need to provide inspection ability across multiplepackets in a connection between a source and a destination. The controlprocessor and scheduler, block 8402, provides such functionality aswell. The control processor may store the internal processing state ofthe runtime adaptable search processor array in a connection databasewhich may be maintained in the configuration and the global memory,8406, or in the off-chip memory. The control processor and schedulerlooks up the execution or analysis state for a given connection when apacket corresponding to the connection is presented to it by the masterprocessor or the incoming traffic in a flow-through configuration. Theconnection ID may be created by the master processor and provided thatto the search processor with the packet to be inspected or the searchprocessor may derive the connection association from the header of thepacket. The connection ID may be created in the IP protocol case byusing a 5-tuple hashing derived from the source address, destinationaddress, source port, destination port and the protocol type. Once theconnection ID is created and resolved in case of a hash conflict by thecontrol processor and scheduler, it then retrieves the state associatedwith that connection and provides the state to the search processorarray, block 8405, to start searching from the state of the connection.This mechanism is used to create multi-packet searches per connectionand detect any search strings or security violations or threats thatspan packet boundaries. For example, if there is a rule defined tosearch for “Million US Dollars” and if this string appears in aconnection data transfer in two separate packets where “Million U”appears in one packet and “S Dollars” appears in another packet then ifa connection based multi-packet search mechanism of this patent is notpresent the security violation may not be detected since each packetindividually does not match the rule. However, when the multi-packetsearch is performed, no matter how far apart in time these two packetsarrive at the search processor, the state of the search will bemaintained from one packet to another for the connection and the stringsof two packets will be detected and flagged as a continuous string“Million US Dollars”.

The control processor and scheduler, schedules the packets or data tothe runtime adaptable search processor array in the appropriate order.The runtime adaptable search processor array, block 8405, adaptationcontroller, block 8404 and configuration and global memory, block 8406are similar to those illustrated in FIGS. 62, 63, 64 and 65. The runtimeadaptable search processor array and the associated blocks providesimilar functionality with appropriate logic enhancements made to coupleto the control processor and scheduler of the search processor. Theruntime adaptable search processor array and its components aredescribed below. The search processor array is presented with eachcharacter of the incoming packet or data, which it then examines forstring match with the RE rules programmed in them. W hen a string ismatched the search processor array provides that indication along withother information like the RE rule identification, the action to betaken, the tag associated with this rule, the start count of the match,the end count of the match and the like to the control processor andscheduler, which may then take appropriate action depending on the ruleor rules that have been matched. The controllers inside the searchprocessor array may also take appropriate action(s) as required asdiscussed below.

FIG. 97 illustrates search compiler flow which is used for full andincremental rules distribution. The search compiler of FIG. 97 allows anIT manager to create search and security rules of different types asdiscussed above in the discussion of FIG. 67 and enable them to create alayered and/or pervasive security model. The compiler flow would beprovided with the characteristics of the specific nodes like thesecurity capability presence, the rules communication method, the sizeof the rule base supported, the performance metrics of the node,deployment location e.g. LAN or SAN or WAN or other, or the like. Thecompiler flow then uses this knowledge to compile node specific rulesfrom the rule set(s) created by the IT manager. The compiler comprises arules parser, block 9704, for parsing the rules to be presented to theFSA Compiler Flow, block 9706, which analyzes the rules and createsrules database used for analyzing the content. The rule parser may readthe rules from files of rules or directly from the command line or acombination depending on the output of the rule engines. The rules for aspecific node are parsed to recognize the language specific tokens usedto describe the rules or regular expression tokens and outputs regularexpression rules, 9705. The parser then presents the REs to the FSAcompiler flow, described below in detail for FIG. 96. The FSA compilerflow processes the incoming REs and generates NFA and DFA for the RE. Itthen decides whether the RE will be processed as a DFA or an NFA basedon the cost of using one vs. the other and also the search processorcapability for the node. It then generates the NFA or DFA rule in aformat loadable into the search processor and stores it in the compiledrules database storage, 9708.

Rules distribution engine, block 9709, follows the central manager andrules distribution flow illustrated in FIG. 57 and FIG. 58 withappropriate changes to communicate to the right configuration of thesearch processor. The search rules may be distributed to the hostprocessor or a control plane processor as illustrated in FIG. 58 or tothe control processor and scheduler, block 8402, or a combinationthereof as appropriate depending on the node capability. The rules maybe distributed using a secure link or insecure link using proprietary orstandard protocols as appropriate per the specific node's capabilityover a network.

The control processor and scheduler, block 8402, communicates with therules distribution engine, block 9709 to receive appropriate compiledrule tables prior to starting the content inspection. It programs thereceived rules into the appropriate NFA or DFA search engines, describedearlier, working with the adaptation controller, 8404. There may bemultiple rules being stored in each search engine dependent on thenumber of application contexts supported by the search processor. Oncethe rules distribution engine provides the compiled rules to the controlprocessor and scheduler and they are setup in their respective engines,the search processor is ready to start processing the data stream toperform content inspection. The search processor state configurationinformation is received via the coprocessor/host interface controller ora media interface controller, not illustrated. The search processor ofthis patent may be deployed in various configurations like a look-asideconfiguration illustrated in FIG. 69 or flow-through configurationillustrated in FIG. 68 or an accelerator adapter configurationillustrated in FIG. 70 as well others not illustrated which can beappreciated by persons skilled in the art. In a look-aside or anaccelerator adapter configuration, the search processor of this patentis under control of a master processor which may be a network processoror a switch processor or a TCP/IP processor or classification processoror forwarding processor or a host processor or the like depending on thesystem in which such a card would reside. The control processor andscheduler receives the configuration information under the control ofsuch master processor that communicates with the rule engine to receivepackets that contain the configuration information and passes it on tothe search processor. Once the configuration is done, the masterprocessor provides packets or data files or content to the searchprocessor for which content inspection needs to be performed using thecoprocessor or host interface. The coprocessor or the host interface maybe standard buses like PCI, PCI-X, PCI express, RapidIO, HyperTransportor LA-1 or SRAM memory interface or the like or a proprietary bus. Thebandwidth on the bus should be sufficient to keep the content searchengine operating at its peak line rate. The search processor may be amemory mapped or an IO mapped device in the master processor space forit to receive the content and other configuration information in alook-aside or accelerator configuration. The search processor may bepolled by the master processor or may provide a doorbell or interruptmechanism to the master to indicate when it is done with a given packetor content or when it finds a match to the programmed rules. The controlprocessor and scheduler, block 8402 and the interface controller, block8401 work with the master processor to provide the above functionality.

The control processor and scheduler stores incoming packets to thepacket buffer, block 8403, and schedules the packets for processing bythe search processor array, block 8405, as they become available toanalyze the content. The scheduler maintains the record of the packetsbeing processed by the specific engines and once the packets areprocessed it informs the master processor. The search processor arrayinforms the control processor and the scheduler when it has found amatch to a rule and the action associated with that rule. Thisinformation may in turn be sent by the control processor to the masterprocessor, where the master processor can take specific action indicatedby the rule for the packet. The actions may be one from a multitude ofactions like dropping the packet or dropping a connection or informingthe IT manager, or the like, as discussed earlier.

FIG. 85 illustrates runtime adaptable search processor array. Theruntime adaptable search processor array is similar to the runtimeadaptable processor illustrated in FIG. 64 and discussed above. Thefunctionality of various blocks of the RAP and search processor array issimilar. The search processor array illustrates that each computecluster, 6401 (1) through 6401 (Z), in FIG. 64 may be either a computecluster or a search cluster or a combination, as illustrated by blocks8501 (1) through 8501 (Z). If the cluster is a compute cluster, then itsfunctionality is the same as that discussed in FIG. 64 and FIG. 65,however if the cluster is a search cluster than its functionality isdifferent as illustrated in FIG. 85 and discussed below. The runtimeadaptable search processor array could be used instead of runtimeadaptable processor in FIG. 62 and FIG. 63 and offer the runtimeadaptable search processing and runtime adaptable compute processingcapability to the adaptable TCP/IP processor for capabilities discussedabove as well as deep packet search, application layer search, upperlayer (layers 4-7) security and the like that can utilize search for alarge number of rules simultaneously. The global memory and globalmemory controller, 8502, provides functions similar to those describedfor block 6402 above, and may also interact with the global andconfiguration memory, 8406, of the runtime adaptable search processorillustrated in FIG. 84.

FIG. 86 illustrates an example search cluster. The search cluster issimilar to the compute cluster illustrated in FIG. 65, however, insteadof compute elements, the search cluster comprises of search engines. Asearch engine may be DFA search engine, illustrated in FIG. 82 or a NFAsearch engine, illustrated in FIG. 80. The search cluster may compriseof a combination of DFA based search engines and NFA based searchengines, or it may comprise of only one type of search engines, eitherDFA based or NFA based. The selection may be made at the time ofimplementation of the search processor for various design optimizationslike keep a regular structure within a search cluster which may resultin efficient area utilization, power grids, clock network and the like.It would be obvious to those skilled in the art that variousconfigurations of the search clusters are feasible from the teachings ofthis patent and it is not feasible to describe them all. The searchcluster may comprise of search engines, cluster routing switch, 8605,cluster memory and flow controller, 8606, or cluster configurationcontroller, 8607 or a combination of the foregoing. The block clustermemory and flow controller, 8606, comprises a cluster memory, a clustermemory controller and a cluster flow controller. The search processormay comprise some clusters that may be DFA based, while others may beNFA based or a combination of DFA and NFA based search clusters. Thus itis possible to create a spectrum of products that may offer varyingnumbers of DFA based and NFA based clusters as described above. Thesearch engines, 8601(1) through 8601(Z) may be interconnected with eachother as necessary to create RE rules that may be larger than thosesupported by single search engine (SE). It may also be possible tocreate groups of SEs that all are part of a RE rule group like thoseillustrated in FIG. 75. SEs that belong to a group may be enabled ordisabled by the controllers, 8707 and 8709 depending on the results ofthe search by the currently enabled group of SEs. The content that needsto be searched for the rules configured in the search engines ispresented to the search engines by the search processor array, FIG. 85,through cluster configuration controller, 8607, the cluster routingswitch, 8605, and the inter-cluster routing switch, 8503. The results ofthe search may also be carried through the similar route from the searchclusters to the control processor and scheduler, 8402. The searchresults are also communicated to the cluster memory and flow controller,8606 and the cluster configuration controller, 8607 which may beconfigured to take actions as required by the rules that get matched.These controllers may inform the actions that need to be taken to otherclusters such that all the search clusters are synchronized with eachother in terms of the rule groups that are active as well as the currentapplication context that may need to be active for each cluster.

The search engines in a search cluster are configured by the clusterconfiguration controller, 8607, which interacts with the globalconfiguration controller, 8505. The global configuration controller isused for global configuration control and interacts with the adaptationcontroller, 8404, control processor and scheduler, 8402, configurationand global memory, 8406 and the memory controller, 8407 to receive,store and retrieve various hardware configuration information asrequired by the rules being configured by the user or an IT managerusing the search compiler flow illustrated in FIG. 97. The searchcompiler compiles the rules as per the capabilities of a specific searchprocessor of search device and then interacts with the control processorand scheduler, 8402, through a host or master processor or otherwise asdescribed earlier, to configure the search processor array with theserules. These rules are communicated by the control processor,interacting with various blocks listed above, to the clusterconfiguration controller. The cluster configuration controller thenconfigures the application state memory, 8001, for an NFA based searchengine and DFA Instruction and State flow memory, 8201, for a DFA basedsearch engine. Once all the search engines in all the clusters areconfigured with appropriate rules and the application contexts thecontent search may start. As described earlier there may be a need forthe search processor to maintain the information of flows or sessions tocontinue searching of content inside packets of the same flow that mayarrive at the search processor at different times. Similarly insolutions that require searching of messages or files or other contentwhich may be sent to the search processor by a host processor, there maybe a need to maintain the context if the content is sent in chunks ofbytes or pages or segments or the like. The flow context may bemaintained in the global memory, 8406, of the search processor or may bestored in the memory coupled to the memory controller, 8407. When thenumber of search engines is large, the amount of information that mayneed to be stored for a given context can cause a performance issue. TheDFA and NFA search engines of this patent may need to save a minimum ofone 32-bit word per context. Hence, for example if the number of searchengines are 1024, distributed amongst 64 clusters, with 16 searchengines each then one would need to store and retrieve up to 1024 32-bitwords from global memory. This can be a significant performance barrierfor the search processor that may need to support 1 Gbps to 10 Gbps linerate. Hence, for applications that need high line rate performance, thisinvention enables the clusters to locally store the flow contextinformation in the cluster memory, 8606. For the current example, if thesearch engines are organized in a 4×4 array, and there are four portsinto the local memory, then on a per cluster basis four 32-bit words offlow context need to be stored and retrieved for the flow contextswitch. Since all the clusters can access their local memories inparallel, the flow context switch can thus be accomplished in 8 clockcycles compared to 2048 clock cycles when the flow information has to bestored and retrieved from the global memory. The loading of a newcontext would require 4 clock cycles, which would be the minimum timeneeded to switch the context where the storing of the context beingswapped out could be done in background while the search of the newcontext begins. Thus, the patent of this invention can solve a majorperformance bottleneck that may exist in architectures that require thecontext to be stored in the global memory.

FIG. 87 illustrates intra-search cluster rule groups and switchingexample. As discussed earlier, regular expression rules may be groupedtogether as illustrated in FIG. 75. The search processor of thisinvention allows the grouping of search engines to enable the rulegroups of regular expressions. FIG. 87 illustrates two such rule groupswhich may be created as per the application need and the applicationcontext. Rule group 1, 8701, is illustrated to comprise of three searchengines, while Rule group 2, 8702, is illustrated to comprise at least 8search engines. The number of search engines in a rule group may varybased on the application context and a rule group may go across one ormore search clusters. Thus a very flexible search processor can becreated by this invention. The illustration primarily shows the rulegroups confined to a search cluster. The figure illustrates an exampleof switching of rule groups. Assuming search engine group 1 is enabledto search for content first and the rule group 2 is disabled. Once thecontent search begins, if one of the rules in rule group 1 finds amatch, it can inform the cluster memory and flow controller, 8707 andthe cluster configuration controller, 8709, of the action that may needto be taken as a result of the match as illustrated by sending a messageS1, 8703 to the memory controller which in turn may send a similarmessage Sn, 8711, to the configuration controller, 8709. The message Snis used to illustrate multiple interactions between the controllers toavoid complicating the figure for each step of the message flow. Thecontrollers in turn send a message S2, 8704, to the rule group 1 todeactivate the search engines in the group and send message S3, 8705 &8706, to rule group 2 to activate the search engines in the rule group2. Once the rule group switching occurs, the newly activated rule group2 starts inspecting the content for its rules, while rule group 1 stopsinspecting the content. Thus nested rule groups may be created usingthis invention. The cluster configuration controller, 8709, may interactwith global configuration controller, 8505 to make a switch to anotherrule group that may reside in another search cluster(s) or to make aswitch to a different application context across the search processor asmay be required by the search result.

FIG. 88 illustrates example rules configuration and context switch. Thefigure illustrates the runtime adaptable search processor array of FIG.85, with different type of rules configured in each of the searchclusters. The blocks 8801 (1) through 8801 (Z) are the search clusterssimilar to the blocks 8501 (1) through 8501 (Z). The figure illustratestwo contexts for each cluster. The search engines in search cluster 8801(1) is configured with HTTP and SMTP rules beside any others not shown,whereas 8801 (L) is configured with HTML and SMTP. The label “I:<AppName1>” is meant to indicate that the cluster has <AppName1>application configured to be the Initial configuration. The label “S:<AppName2>” is meant to indicate that the cluster switches to AppName2under the control of the global configuration controller. There are someclusters marked with label “I: WAIT”, which is meant to illustrate thatthose clusters are not active during the Initial phase of the search.However, those clusters may switch to the application context shown withthe “S: <appName>” label. When the search processor is configured likethe illustration in FIG. 88 with the initial set of clusters beingactive, then those clusters initially inspect the incoming content. Inthis example each cluster is searching for a particular type ofapplication content being received, which may be done in a networkedsystem for instance. Assuming that the incoming packet belongs to anSMTP application, then the search engines in search cluster, 8801 (3),detect the SMTP packet and inform the cluster controller, which theninteracts with the global configuration controller by sending a message,8817, which indicates that SMTP content is found and hence otherclusters should apply rules for SMTP which may be anti-spam rules orparsing the SMTP content into its components or the like. The globalconfiguration controller then sends the messages, 8818, 8819, 8820, toall the clusters for them to switch their context to SMTP if they havethat context configured and start applying those rules to the contentbeing received. Thus, the search processor array can be adapted atruntime to perform complex search operations. Though the examples ofFIG. 87 and FIG. 88 are fairly simple to keep the description simple,there can be a lot more complex scenarios that can be created using theruntime adaptable search processor as may be appreciated by thoseskilled in the art.

FIG. 89 illustrates a computing device with content search accelerator.The computing device may be a server, a workstation, a personalcomputer, a networking device like a switch or a router or other type ofcomputing device. This is one type of configuration in which a contentsearch accelerator using one version of the content search processor ofthis invention may be used. The figure illustrates a computing devicecomprising one or more CPUs, 8900 (1) through 8900 (n), at least onechipset, 8902, at least one memory component, 8901, with at least onecontent search accelerator, 8903, and zero or more adapters providingother functions. The content search accelerator may comprise of searchprocessor, 8904. It may also comprise at least one memory component,8905, coupled to the search processor. There are many different systemconfigurations that may be created with the content search processor ofthis invention. Hence the examples in this patent should not be used aslimiting the scope to them, rather they are primarily a means to explainthe search processor in a few sample usage scenarios. The searchprocessor of this patent may be used on line cards, network adapters ornetwork interface controllers, storage networking cards, 10 cards,motherboards, control processing cards, switching cards or other systemelements of systems like networking devices such as routers, switches,management devices, security devices, gateways, virtualization devices,storage networking devices, servers, storage arrays, and the like. Thesearch processor or its components may also be coupled to themicroprocessors, network processors, switching chips, protocolprocessors, TCP/IP processors, control plane processors, chipsets,control processors or other devices, including being incorporated as afunctional block on these processors or chips. The content searchprocessor may be used to perform content inspection at high line ratesin the systems in which it is incorporated to offload or assist incontent processing to the main processors of such systems. There may beconfigurations where multiple search processors may also be incorporatedin systems to provide scaling in performance or number of rules or acombination thereof for content search. The search processor may beincorporated on network line cards, in line with the traffic and offerline rate deep packet inspection as discussed earlier.

The configuration illustrated in FIG. 89 may be used for email securityor instance message security or outbound security or extrusion detectionor HIPAA compliance or Sarbanes-Oxley compliance or Gramm-Leach-Blileycompliance or web security or the like or a combination thereof Thesecurity capabilities listed may comprise anti-spam, anti-virus,anti-phishing, anti-spyware, detection/prevention of directory harvestattacks, detection/prevention of worms, intrusion detection/prevention,firewalls, or the like or detection/prevention of leaks of confidentialinformation, health care information, customer information, credit cardnumbers, social security numbers or the like or a combination thereofThe content search processor or processors in such device may beconfigured with a set of security rules for one or more of theapplications listed above and provide acceleration for content searchfor information incoming or outgoing from the device. Since, the contentsearch processor may be used inline with the traffic, the device may bedeployed at any place in the network, like close to a router or a switchor gateway of an organization's networks or at a departmental level orwithin a datacenter or a combination and provide high speed contentinspection to incoming or outgoing traffic flow of the network.

FIG. 90 illustrates example anti-spam performance bottleneck andsolution. As discussed earlier, content search performance using a DFAor NFA based search on a microprocessor results in below 100 Mbpsperformance. FIG. 90 illustrates an anti-spam application as an exampleapplication to show the value of hardware based content search. Theperformance numbers are not illustrated to scale. The figure illustratesfour vertical stacks of operations in four types of appliances. Thefirst stack, 9000, is illustrated to represent an email appliance stack.An email appliance typically may comprise device drivers to drive thehardware devices on the appliance, the networking protocol stack alongwith other functions of the Operating System (OS) and a mail transportagent (MTA) which are all typically software components along with otherapplication software. Today's servers, which are typically used foremail appliances, are able to keep up with network line rates of up to 1Gbps, and perform the application functions due to the high performanceprocessors. Typically a 1 GHz processor is required to process 1 Gbpsline rate traffic for network protocol stack processing. Since the stateof art processors are around 4 GHz today, the servers can handle thenetwork traffic and have processing power available to do other needs ofthe OS and the applications running on a server. Thus the emailappliance stack, 9000, running on a high end server, should be able tokeep up with a high line rate. A study by network world magazine, “Spamin the Wild: Sequel” done in December 2004, showed the performancecomparison of a large number of anti-spam software and appliancevendors. Under their configuration the range of the message processingperformance of the vendor products listed was from around 5 messages persecond to 21 messages per second. When this performance number istranslated into line rate performance using the worst case message sizesused by network world of 10,000 characters per message, the line rateperformance comes to be below 2 Mbps sustained performance. All thevendors either software or appliance solutions were based on dual Xeonprocessor servers. Thus, a server that can handle 1 Gbps network linerate traffic, when performing anti-spam application its performancedrops down to below 10 Mbps. The reason for this is that one of thefeatures used extensively by most anti-spam vendors is searching ofemails against a set of rules, which are typically represented asregular expressions. The anti-spam appliance stack, 9001, illustratesthe email appliance with anti-spam capability loaded on it. Anti-spamapplications typically performs many complex regular expression rulesbased filtering along with statistical filtering, reputation basedfiltering and the like. The anti-spam rules are typically appliedsequentially to each incoming email one after the other to find a rulethat may match the content of the email. Then the anti-spam applicationmay apply scores to the rules that match and then decide if a message isspam or not based on the total score it receives. Such an operationcauses the stack performance needs to grow substantially higher than atypical email appliance stack, where the anti-spam filters, 9005,overhead on the performance of the appliance is substantial to reducethe over all anti-spam server appliance performance to be below 10 Mbps.The content search processor of this invention can be used in suchanti-spam appliances to achieve significant performance improvements.The hardware accelerated anti-spam appliance stack, 9002, illustratesthe impact of using the search processor of this invention on theoverall performance of the system. In such a case, all the anti-spamfilters, 9011 thru 9013, may be configured on the search processor,9006, which in turn may be used to inspect each incoming message. Sinceall rules would be searched simultaneously, the search processor basedappliance can achieve 1 Gbps line rate performance or more, since thehost CPU is relieved from the performance intensive regular expressionsearches. The compute device illustrated in FIG. 89 may be one suchconfiguration that may be used as the anti-spam appliance to achievemultiple orders of magnitude higher performance than a standard serverbased anti-spam appliance. The stack, 9003, illustrates a stack of anenhanced messaging appliance which may use the TCP/IP offload processorfor offloading the protocol processing from the host CPU along with thecontent search processor of this invention. Thus a significant amount ofCPU bandwidth can be made available to other applications which may nothave been possible to execute on the computing device withoutsignificant performance impact. The use of TCP/IP offload and contentsearch processor may be done individually or in combination and the useof one does not require the use of the other. TCP offload and contentsearch processors could be on the same device providing networkconnectivity and the acceleration.

FIG. 91 illustrates an anti-spam with anti-virus performance bottleneck.This figure is very similar to FIG. 90, except that the anti-spamappliance whose stack is illustrated also supports anti-viruscapability. Anti-virus searches are different then the anti-spamsearches but they also add a significant performance overhead asillustrated by the stack, 9104. The number of filters for anti-virus islot larger then those for anti-spam, though when a content searchprocessor(s) of this invention is used the anti-virus overhead can alsobe substantially reduced as illustrated by 9105.

FIG. 92 illustrates application content search performance bottleneckand solution. The content search processor of this invention can be usedas a search accelerator for a large number of applications that requirecontent search but do the search on the host processor. Since, theperformance of the host processors for content search is not very highas discussed above, a content search processor based accelerator cansubstantially increase the performance of these applications. Theapplications that require content search are many like data warehousingapplications, database applications, bioinformatics relatedapplications, genetics, proteomics, drug discovery related applicationsand the like. The figure illustrates three boxes, 9200, 9201 and 9202which represent the content search based application performance interms of host CPU load. The traditional applications run on a server ora workstation or personal computer, and perform content searchinterspersed with other tasks that the application needs to perform. Ifthe applications perform a significant amount of search, then theperformance need of the search portions of the application can besubstantially higher then the other parts. This is illustrated bycontent search portions of applications app1 and appN, 9203 and 9205respectively, compared to the other code of these applications, 9204 and9206. The stack in 9200 is the how the current or prior art solutionexists for content search applications. Though the stack illustrates acontinuous stack for content search and other code sections, the actualexecution may generally be composed of search interspersed with othercode functions. When a content search processor and accelerator of thisinvention is used in the computing device performing this function, itmay be possible to have the application leverage the search capabilitiesof the processor and accelerate the performance of the applicationsubstantially compared to a computing device without the searchacceleration support. The stack in 9201, illustrates the impact on theCPU load and the resulting time spent by the application when convertedto leverage the content search accelerator. The stacks 9203 and 9205,could take substantially less load and time as illustrated by stacks,9207 and 9208 respectively. Similarly, the performance of the system maybe further increased by offloading the TCP/IP protocol processing asillustrated by 9209. As described above, TCP/IP offload and contentsearch offload are independent of each other and may each be donewithout the other in a system. However, one could also use the contentsearch processor with the TCP/IP processor together as separatecomponents or on the same device and achieve the performance benefits.

FIG. 93 illustrates an example content search API usage model. Asdiscussed above, the content search processor may be used to acceleratecontent search portions of generic applications. To enable an ease ofcreation of new applications and migrate existing applications toleverage such search processor acceleration capability one may create anapplication programming interface (API) for content search. An examplecontent search API is illustrated in FIG. 94 and described below. Thecontent search API may reside in the user level or the kernel level withuser level calls, or a combination. The FIG. 93 illustrates the contentsearch API at the user layer, 9307. The content search API would provideAPI functions that any application can call to get the benefit ofcontent search acceleration. There would be a convention of usage forthe applications to use the content search API. For example theapplication may be required to setup the search rules that can beconfigured on the search processor using the API calls before theapplication is run or may be required to dynamically create the rulesand setup them up in the appropriate format so that they can beconfigured on the content search processor using the API or acombination. There would be API calling conventions that may beestablished dependent on the hardware system, the operating system orthe search processor or a combination. The applications may then becoded to the API conventions and benefit from the search processoracceleration. The figure illustrates applications App1, 9300 through AppN, 9303, working with the content search API, 9307 to get access to thecontent search acceleration hardware, 9317, using logical interfacepaths illustrated as 9312, 9313 and 9314. The content search API mayaccess the services and resources provided by the content searchaccelerator through a port driver which may be running under a kernel.The applications may pass the content to be searched directly throughthis interface or put the content to be searched as well as tables to besetup as needed, in the application's buffers, 9304, 9305, and theninstruct the content search processor to retrieve the information fromthese buffers through the content search API. The API may map thesebuffers to the kernel space so the port driver for the search API canprovide them to the content search processor or the buffers may be madeavailable for direct memory access by the search processor hardware. Thesearch processor may store the content in on-chip or off-chip memorybuffers, 9318, and then perform the requested search on the content.Once the search is complete the results of the search may be providedback to the application using a doorbell mechanism or a callbackmechanism or data buffers or the like as allowed by the operatingsystems' model. The content search API may provide a polling mechanismas well which may be used by the application to check and/or retrievethe search results.

FIG. 94 illustrates an example content search API with examplefunctions. The figure illustrates a set of functions which may be a partof the example content search API. Though, the list of functions may bemore or less than those illustrated, the functions provide a basic setthat would enable an application to use the content search hardware withthe use of the API. The example functions do not illustrate the input,output or return parameters for API function calls, which may depend onthe operating system, calling conventions and the like. An applicationmay use the API, by first querying the capabilities of the hardwareengine and then initializing it with appropriate rules, pointers,permissions and the like that may be required for the content searchprocessor to communicate with the application and its resources throughthe kernel or the user mode or a combination. The application may setspecific rules as DFA rules or NFA rules which may get configured in thesearch processor. An application may be given access to multiplecontexts that it may be able to leverage to perform context basedsearch. The application can start performing the search against itscontent once the content search processor is appropriately setup withall necessary rules. The application can communicate the content to besearched directly to the search processor using the API by sending bytestream through the interface. There may be versions of an API function,not illustrated, like sendData( ) which may be used by an application tostart sending data to the search processor, start the search and toindicate when the search processor should stop searching. A moreefficient way of performing the search may be that the application mayfill a buffer or a set of buffers to be searched, and then provide thesearch processor with a pointer(s) to the buffer(s) who can then startsearching the buffers with the configured rules once it receives a callto start the search using an API call like startHWsearch( ). The searchprocessor may have been initialized to communicate the results of thesearch to the application through one of many mechanisms like copyingthe results to a result buffer or storing the result on the memoryassociated with the search processor or invoking a callback functionregistered by the application to the operating system or the like. Thesearch processor may also communicate to the application with a doorbellmechanism to inform it that the search is done. There are many differentways of communicating the information as described earlier and may bedependent on the operating system and the system hardware architecture.There may also be polling mechanism available with an API function likeis SearchDone( ), not illustrated, which may provide the answer to aquery to the search hardware whether a specific search is complete. Ifthe answer from the hardware to the application is that the search isdone, then the application may ask for the specific result using an APIcall like getRes( ), or the application may ask for a pointer to abuffer that may hold the result using a call like getResPtr( )illustrated in FIG. 94. Once the application is done with the specificsearch or is done using the search processor it may call the APIfunction stopHWsearch( ) to stop the hardware processor from performingthe search for this application. There may also be an API call likeremoveAppContext( ), not illustrated, which may be called by theapplication to indicate to the OS and the search processor hardware thatthe application is done using the search processor and hence all itsassociated context may be freed-up by the search processor hardware foruse by another application that may need the search processor resources.There may be other hardware features specific API calls as well, likesetRuleGroup( ), selectRuleGroup( ), setInitGroup( ) and the like, thatmay allow an application to create groups of rules and the order oftheir execution using mechanisms of rule grouping described earlier. Asdiscussed earlier there may be many more functions and variation of APIfunctions that can be created to enable a general content searchapplication acceleration using a hardware search processor from theteachings of this patent that will be obvious to one skilled in the art.Thus it is possible to create a content search API to provide contentsearch capabilities to general applications. Though, the descriptionabove is given with an example where the rules to be used are setup byan application before starting the search, it may be possible to updatethe rule set that is configured in the search processor dynamicallywhile the search is in progress by adding, removing and/or modifying therules that have already been configured to start using the updated ruleset for any future searches by the application.

FIG. 95 illustrates an example application flow (static setup) using thesearch processor. The flow illustrates a static process for setting upthe rules and the hardware processor although as discussed above adynamic setup is also feasible as would be obvious to one skilled in theart. The flow may allow an application to add/remove/modify rules in theprocessor as the application executes at runtime to enable a dynamicflow. The illustration provides a mechanism where existing applicationsor new applications may be updated with content search rules and APIcalls which can enable the application to use a content searchprocessor. An application source, 9500, may be updated, 9501 to createapplication source with modifications for content search where thecontent search rules may be setup in distinct code sections or may beclearly marked, 9502, as expected by the compiler coding conventions,which is then compiled by a content search aware compiler, 9503. Thecompiler generates an object code, 9504, with content search rulescompiled in sections which a loader may use to configure them in thesearch processor. The application object code may then be distributed tocustomers or users of content search processor for accelerating theapplication's search performance. The application code may bedistributed electronically using the internet, world wide web,enterprise network, or other network or using other means like a CD,DVD, or other computer storage that can be used to load the application.The application update, 9501, may be done manually or using a tool orboth as appropriate. The distributed object code, 9506, is read by theloader, 9507, or a similar function provided by an operating system towhich the application is targeted, and setup for execution on thesystem. The loader or another function may use a set of API calls or aport driver or other OS function or a combination to configure thecontent search processor with appropriate rules that the applicationneeds as coded in the object code as illustrated by block 9508. Once thesearch application hardware is setup and other resources that theapplication needs get reserved or setup, the application is started,9509, by the OS. The application may execute or perform tasks, 9510, ifneeded before content search. The application may then setup thecontent, 9511, it needs to search by the search processor. Then itstarts the search processor to perform search, 9513. Once the search isdone it may retrieve the results, 9514. While the search is beingconducted by the search processor, the application may continue toperform other tasks, 9512, on the main CPU or other elements of thesystem. If the application is done the application may exit, otherwisethe application may continue the execution where more tasks may beperformed including new search if necessary. The flow diagramillustrates the execution of tasks as a loop from 9515 to 9510, thoughthe tasks being executed may be very different from one time to the nextthrough the loop. The loop is not illustrated to mean that the same codesequence is being repeated. It is meant to show that the type of tasksmay be repeated. Further, not all tasks from 9510 through 9515 may needto be present in an application flow. Once the application is done, itmay release all the resources it uses beside those for the contentsearch processor.

FIG. 96 illustrates FSA compiler flow. The content search rules in aregular expression form are used as an input to the FSA compiler flow.The FSA compiler searches through all the rules or groups of rules, andparses them, 9602. The compiler then generates an NFA and a DFA to meetthe content search processor's implemented architecture as illustratedby steps 9603 and 9604. The compiler then compares the cost ofimplementing the RE as an NFA and a DFA and selects the NFA or DFAaccording to the processor capabilities as illustrated by the steps 9606and 9611. Compiler generates the NFA or DFA code possibly with actionand/or tag as needed by the RE or RE group. The compiler continues fromthe steps 9602 through 9607 until all the RE rules have been processed.It then generates all the rule FSA code and tags as necessary for theentire RE rule set as illustrated by 9608 and stores the rules in arules database in a storage, 9609, which may be RAM, ROM, hard disk,floppy, CD, DVD or any other computer storage. The compiled rules maythen be distributed to the content search processors as illustrated bythe step 9610. The rules may be distributed by a distribution enginelike that described above for the security compiler flow.

FIG. 97 illustrates a search compiler flow (full and incremental ruledistribution). The flow can be used for distributing search rules orsecurity rules when the full set of rules are defined or when anyupdates or modifications are made to the rule set and incrementalchanges to the rule set need to be communicated and configured in thesearch processor. The flow illustrated in FIG. 97 is very similar tothat illustrated in FIG. 67 except that the FSA compiler flow describedabove is a part of the flow in FIG. 97. The rules like application layerrules, network layer rules or storage network layer rules or any othersearch rules may be created using manual or automated means and providedas inputs to the search compiler flow in a predefined format. The searchcompiler's rule parser, 9704, parses the rules and converts them intoregular expression format if the rules are not already in that form.Then the regular expression rules are passed through the FSA compilerflow which is illustrated in FIG. 96. The output of the FSA compiler isthe rules compiled to the node capabilities of the node that has thecontent search processor and stored in the rules database. The rulesfrom the rule database are retrieved and distributed by the rulesdistribution engine to the appropriate node(s) with the searchprocessor. The rules distribution engine, block 9709, distributes therules to the appropriate nodes using a central manager flow and rulesdistribution flow similar to that illustrated in FIG. 57 and FIG. 58.The search or security rules may be distributed to the host processor ora control plane processor as illustrated in FIG. 58 or to a controlprocessor and scheduler, block 8402, described above, or a combinationthereof as appropriate depending on the node capability. The rules maybe distributed using a secure link or insecure link using proprietary orstandard protocols as appropriate per the specific node's capabilityover a network. The network may be a local area network (LAN), wide areanetwork (WAN), internet, metro area network (MAN), wireless LAN, storagearea network (SAN) or a system area network or another network typedeployed or a combination thereof The network may be Ethernet based,internet protocol based or SONET based or other protocol based or acombination thereof

FIG. 98 illustrates FSA synthesis and compiler flow for FPGA. The Flowillustrated in this figure is similar to the FSA compiler flowillustrated in FIG. 96. The steps, 9801 through 9811 are essentially thesame as the steps 9601 through 9611. The FSA synthesis and compiler flowfor the FPGA illustrates a few additional steps which can be used toconvert the FSA rules to synthesize a content search processor for themin an FPGA. The RE rules converted to NFA or DFA is used to generateappropriate RTL in an automated way as illustrated by 9812. The NFA orDFA RTL generation may use NFA and DFA building block RTL from a libraryas illustrated by 9817. Once the RTL for all elements of the searchprocessor array is generated from all the RE rules, the top level RTLfor the search processor may be generated with NFA or DFA specificsearch engines targeted for the compiled rules. The output RTL is thenmerged with any other RTL code for the search processor, for example amemory controller, and output may be stored in computer storage. Thismerged RTL is then provided as an input to the FPGA RTL synthesis toolswhich generate an FPGA based search processor specifically targeted tomeet the needs of the regular expression rules that are input to the FSAsynthesis and compiler flow.

The processors of this invention may be manufactured into hardwareproducts in the chosen embodiment of various possible embodiments usinga manufacturing process, without limitation, broadly outlined below. Theprocessor may be designed and verified at various levels of chip designabstractions like RTL level, circuit/schematic/gate level, layout leveletc. for functionality, timing and other design and manufacturabilityconstraints for specific target manufacturing process technology. Theprocessor design at the appropriate physical/layout level may be used tocreate mask sets to be used for manufacturing the chip in the targetprocess technology. The mask sets are then used to build the processorchip through the steps used for the selected process technology. Theprocessor chip then may go through testing/packaging process asappropriate to assure the quality of the manufactured processor product.

While the foregoing has been with reference to particular embodiments ofthe invention, it will be appreciated by those skilled in the art thatchanges in these embodiments may be made without departing from theprinciples and spirit of the invention.

1. A system for operating on network packets or on local content, saidsystem capable of being coupled to a network, said system comprising aruntime adaptable search processor to process content embedded withinsaid network packets or within said local content.
 2. The system ofclaim 1 wherein said runtime adaptable search processor further providesruntime adaptable search processor array hardware to dynamically selecthardware configurations within said runtime adaptable search processorto support application or service processing needs of said networkpackets or said local content.
 3. The system of claim 1 comprising acontent search API to enable applications to utilize the runtimeadaptable search processor hardware to accelerate content search;wherein the content search API comprises one or more of the following:means to initialize said runtime adaptable search processor; means tocommunicate content search rules to said runtime adaptable searchprocessor; means to communicate the content to be searched by saidruntime adaptable search processor; means to communicate the results ofsaid content search by said runtime adaptable search processor to saidapplications; and means to start and stop said runtime adaptable searchprocessor to start and stop content search.
 4. The system of claim 1wherein said runtime adaptable search processor enables one or more ofthe following: email security applications comprising one or more of thefollowing: anti-spam, anti-virus, anti-spyware, anti-phishing,confidential leak prevention, and regulatory compliance; and websecurity applications comprising one or more of the following: firewall,intrusion detection and prevention, application layer security,anti-spam, anti-virus, anti-spyware, anti-phishing, confidential leakprevention, and regulatory compliance, wherein said applications can beperformed at network line rates, wherein said network line rates liewithin a range of line rates from below 100 Mbps to 10 Gbps and higher.5. The system of claim 1 wherein said runtime adaptable search processorreceives data packets from a network and sends data packets to anetwork, and further provides capability to select one or more of thefollowing: a. policy based services, applications, or policies, or acombination thereof, on one or more of the following: the data packetsreceived from said network, and the data packets sent to said network;b. user selected services, applications, or policies, or a combinationthereof, on one or more of the following: the data packets received fromsaid network, and the data packets sent to said network; c. user definedservices, applications, or policies, or a combination thereof, on one ormore of the following: the data packets received from said network, andthe data packets sent to said network; and d. user configured services,applications, or policies, or a combination thereof, on one or more ofthe following: the data packets received from said network, and the datapackets sent to said network.
 6. An apparatus comprising a runtimeadaptable finite state automaton (FSA) architecture for implementing atleast one regular expression, said FSA architecture having a pluralityof states, said FSA architecture comprising: a. at least oneinterconnection between any two of said plurality of states; b. a statedependent vector (SDV) for each of said plurality of states that is usedto enable or disable said at least one interconnection between said twostates and other states of said plurality of states; c. a current statevector (CSV) for each of said plurality of states representing thecurrent state of some of said plurality of states which the state ofsaid plurality of states depends on for its next state evaluation; d. atleast one symbol associated with each of said plurality of states forcontrolling the transition into or out of said each of said plurality ofstates; e. at least one symbol detection logic for receiving symbols,detecting the value of said symbols and generating a received symbolvector (RSV); f. at least one application state memory for storing aplurality of application state contexts that use said finite stateautomaton, each of said plurality of application state contextscorresponding to a regular expression; and g. a state transition logicfor each of said plurality of states that uses at least said statedependent vector, said current state vector, and said received symbolvector to generate the next state for each of said plurality of states.7. The apparatus of claim 6, wherein said FSA architecture furthercomprises one or more of the following: a. a programmable storage forsaid SDV for enabling runtime configuration of said SDV; b. aprogrammable storage for a left-biased (LB) or right-biased (RB) signalfor enabling runtime configuration of the realized FSA type; c. aprogrammable storage for storing a start flag to indicate if said stateis a start state of said FSA; d. a programmable storage for storing anaccept flag to indicate if said state is an accept state of said FSA; e.a programmable storage for storing an action field to indicate theaction that should be taken when said state is activated or reachedduring said FSA evaluation; and f. a programmable storage for storing atag field to indicate a tag that is issued when said state is activatedor reached during said FSA evaluation.